Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Pharmacy Medical University of Sofia

Similar presentations


Presentation on theme: "School of Pharmacy Medical University of Sofia"— Presentation transcript:

1 School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction Ivan Dimitrov 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011

2 Allergen processing pathways
Allergy is a form of hypersensitivity to normally innocuous substances as dust, pollen, foods or drugs. Allergens are small antigens that commonly provoke an IgE antibody response. Such antigens normally enter the body at very low doses by diffusion across mucosal surfaces and trigger a Th2 response. The allergen-specific Th2 cells drive allergen-specific B cells to produce IgE, which binds to the high-affinity surface receptor, called FcεRI, on mast cells, basophils and activated eosinophils. On activation, these cells release stored mediators, which cause inflammation and tissue damage manifested by different symptoms. Inhalant allergens cause rhinitis, conjunctivitis and asthmatic symptoms, while food allergens lead to abdominal pain, bloating, vomiting and diarrhea. Food allergens rarely cause respiratory reactions and inhalant allergens rarely affect the gut (Rusznak and Davies, 1998, Wiki). C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005,

3 FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins
A query protein is potentially allergenic if it: has an identity of 6 to 8 contiguous amino acids or has > 35% sequence similarity over a window of 80 amino acids Although there is no consensus allergen structure, FAO and WHO have produce Codex alimentarius guidelines for evaluating potential allergenicity for any novel protein. According to these guidelines, a query protein is potentially allergenic if it either has an identity of 6 to 8 contiguous amino acids or >35% sequence similarity over a window of 80 amino acids when compared with known allergens. when compared with known allergens. Codex Principles and Guidelines on Foods Derived from Biotechnology Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.

4 Bioinformatics approaches to allergen prediction
Sequence-alignment search of query protein Extensive databases of known allergen proteins and the FAO/WHO guidelines - Structural Database of Allergenic Proteins - Allermatch Characteristics: High sensitivity (true positives/(true positives + false negatives)) - Produce many false positives and low precision (true positives/(true positives + false positives)) - Discovery of novel antigens is restricted by their lack of similarity to known allergens. Nowadays two bioinformatics approaches exist to deal with allergen prediction. The first approach follows FAO/WHO guidelines and searches for sequence similarity. Structural Database of Allergenic Proteins (SDAP) and Allermatch and contain extensive databases of known allergen proteins and use them as references in sequence-alignment search of query protein. These methods characterize with high sensitivity, but produce many false positives and low precision. Besides, discovery of novel antigens is restricted by their lack of similarity to known allergens Ivanciuc et al. Nucleic Acids Res. 2003, 31, 359–362 Fiers et al. BMC Bioinformatics 2004, 5, 133

5 Bioinformatics approaches to allergen prediction
2. Identification of conserved allergenicity-related linear motifs Comparing allergens to non-allergens by MEME motif discovery tool - Clustering of known allergens, wavelet analysis and hidden Markov model - Automated Selection of Allergen-Representative Peptides (DASARP). Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP) - Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine Both approaches are based on the assumption that the allergenicity is a linearly coded property. The second approach is based on identification of conserved allergenicity – related linear motifs. These methods use different techniques for identification, representation and analysis of allergenicity – related motifs. Both approaches are based on the assumption that the allergenicity is a linearly coded property. Stadler and Stadler FASEB J. 2003, 17, Saha and Raghava Nucleic Acids Research,2006,34, Li et al. Bioinformatics 2004, 20, Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50

6 Allergens are proteins with different length.
AIM of the study To create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. Obstacles: The choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences. Our aim was to create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. The main obstacles in this case are: the choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences and the different length of allergens Allergens are proteins with different length.

7 hydrophobicity molecular size polarity
The z-scales The principal properties of the amino acids were represented by z descriptors, originally derived by Hellberg et al. [14] to describe amino acid hydrophobicity, molecular size and polarity. These scales were derived by PCA (principal component analysis) of a data matrix consisting of 29 physico-chemical variables, such as molecular weight, pKa's, 13C NMR-shifts, etc. These z-scales reflect the most important properties of amino acids and are therefore often referred to as the "principal properties" of amino acids. With the three z-scales it is possible to numerically quantify the structural variation within a series of related peptides, by arranging the z-scales according to the amino acid sequence …Phe – Arg – Trp… z z z3 hydrophobicity molecular size polarity z1 z2 z3 z1 z2 z3 z1 z2 z3 Hellberg et al. J. Med. Chem. 1987; 30,

8 ACC transformation Auto-covariance Cross-covariance
j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence; Phe – Arg – Trp – Phe – Arg – Trp protein z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 The auto cross covariance (ACC) transformation turns the protein sequences into uniform equal-length vectors. ACC is an protein sequence mining method developed by Wold et al., which has been applied to quantitative structure-activity relationships (QSAR) studies of peptides with different length and for protein classification. The ACC transformation accounts for neighbour effects, i.e. the lack of independence between different sequence positions by lag variable. In the equations index j refers to the z-descriptors (j = 1-3), n is the number of amino acids in a sequence, index i points the amino acid position (i = 1, 2, …, n) and lag is the lag (l = 1, 2, …, L). In our study short lags (lag= 5) have been chosen as only the influence of the close amino acid proximity was investigated. /5 ACC11(1) z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC13(1) Wold et al. Anal. Chim. Acta 1993, 277:

9 matrix with 45 variables (32 x 5)
Preliminary study 595 food allergens from CSL allergen database 595 non-allergens from NCBI database Training set 475 food allergens 475 non-allergens Test set 120 food allergens 120 non-allergens ACC transformation of z descriptors matrix with 45 variables (32 x 5) and 950 observations external validation A set of 595 food allergens was collected from the CSL (Central Science Laboratory) allergen database ( . A corresponding (from the same species) set of 595 non-allergens was collected from NCBI database ( A training set of 475 allergens and 475 non-allergens, based on equal representation of all species in the initial set was formed. The amino acid sequences were represented by z descriptors and a matrix of 45 variables and 950 observation was formed after ACC transformation. We applied different machine learning methods on that matrix and validated the corresponding models on a external set. It consists of 120 allergens and 120 non-allergens. PLS discriminant analysis was performed by SIMCA software. K nearest neighbours algorithm was performed by a Python script based on a BioPython module. Logistic regression, Naïve – Bayes and decision tree algorithms were performed by Orange visualization and analysis tool. The results are evaluated using Sensitivity, Specificity and Accuracy of the corresponding method. statistical methods, machine learning Sensitivity Specificity Accuracy PLS - discriminant analysis Logistic regression Naïve - Bayes algorithm Decision tree algorithm k Nearest Neighbours

10 Results from preliminary study
TP – true positive, FP – false positive TN – true negative, FN – false negative Comparison of the methods shows best results for K nearest neighbours at K=5. All of the methods have some imbalance in sensitivity and specificity but for PLS-DA it is significant. The most homogeneous results according to specificity and sensitivity is observed for K nearest neighbour algorithm. The difference in specificity and sensitivity for all of the methods supposes the need of further improving of the training set.

11 Web servers on the test set
Algpred   - SVM with single aa composition - SVM with dipeptide composition Evaller APPEL Allerhunter Test set 120 food allergens 120 non-allergens Sensitivity Specificity Accuracy We tested the performance of the available web servers on our testset and compared them to our best result KNN(5). All of the servers use support vector machines (SVM) as a machine learning method and different kind of methods for peptide representation. The comparison of the results shows imbalance in sensitivity and specificity for almost all of the servers. The servers with the most homogeneous values for specificity and sensitivity are actually the one with the best performance. Highest results among the servers is achieved by Allerhunter: 87% sensitivity,92% specificity and 89.9% accuracy. Saha and Raghava Nucleic Acids Research,2006,34, Barrio et al., Nucleic Acids Research 2007, 35, Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861

12 Conclusions from the preliminary study
The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study. 2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 1.The model developed by the k Nearest Neighbors method shows the best performance on the test set comparing to the other methods. It has a good balance between specificity and sensitivity, and the highest accuracy. kNN was used further in the study. 2.The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 3.A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too. 3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.

13 The kNN algorithm Training set 475 allergens, 475 non-allergens
Unknown protein ACC transformation of z descriptors ACC transformation of z descriptors vector with 45 variables (32 x 5) matrix of 45 variables (32 x 5) and 950 observations Calculate the Euclidian distance between the vector and each observation The protein sequences of the training set, containing 475 allergens and 475 non-allergens are presented through vectors of z descriptors. The vectors formed are subjected to ACC transformation, which turns the training set into a matrix with 45 variables and 950 observations. Every protein from the testset is represented by z descriptors and transformed to a vector with 45 ACC values. The Euclidian distance between the vector of unknown protein and all of the 950 observations is calculated and the obtained values are sorted in ascending order. The K nearest neighbours are the K observations with the least value and the class of the tested protein is the class of the majority of the neighbours. Sort the distance by value in ascending order Determine the class of unknown allergen according to the majority of nearest neighbours Determine the k nearest neighbours

14 Next: Extend the data sets
CSL allergen database, FARRP allergen database SDAP database, ADFS database 684 food, 1157 inhalant, 553 toxins, venom or salivary allergens Allergen species NCBI database Create local database We extract data for food and inhalant allergens from four databases and use allergen species from the resulting sets to collect local database with protein records of all allergen species (max record for species 1000). From this local database we blast proteins against a collected set of all allergens (food and inhalant) to form a set of non-allergen with no sequence similarity but from the same species. The result was two data sets with 684 food allergens and 684 non-allergens from the same species and 1157 inhalant allergens and the same number non allergens from the same species. Proteins from allergen species Blasts search against all allergens 684 non-allergen from food origin 1157 non-allergens from inhalant origin 553 non-allergens from species with toxins, venom or salivary allergens

15 Next: kNN optimization
684 food allergens 684 non-allergens Training set 528 allergens 528 non-allergens Test set 156 allergens 156 non-allergens machine learning external validation k nearest neighbours We use the set of food allergens and non-allergens to optimize the kNN algorithm, which showed best performance among all the machine learning methods. The set with food allergens was divided to training set of 528 allergens and corresponding non-allergens and a test set of 156 allergens and corresponding non allergens. KNN models with different K values were trained and tested to find the best K value for that set. Increasing the value of K lead to a slight increase in specificity, but sensitivity decreased significantly. As a result there were reduce in accuracy with increasing of K. Best results for accuracy was achieved for K=3 although most homogeneous result with respect to all of the tree parameters was achieved for K=5 and K=7. Sensitivity Specificity Accuracy

16 kNN models Sensitivity Specificity Accuracy 684 food allergens
684 non-allergens 1157 inhalant allergens 1157 non-allergens Test set 156 allergens 156 non-allergens Training set 528 allergens 528 non-allergens Training set 933 allergens 933 non-allergens Test set 224 allergens 224 non-allergens external validation external validation external validation k NN k = 3 k NN k = 3 Each of the sets with food and inhalant allergens and non-allergens was divided to training and test set. The training sets of food and inhalant allergens were used for creating KNN models with K=3 since it had best performance during optimisation step. The models for food and inhalant allergens were validated with the respective test sets and with the whole set of inhalant and food allergens respectively. Sensitivity Specificity Accuracy

17 kNN models The results show that while the test set has not significant effect on the specificity, the sensitivity depends clearly on it. Both of the models shows high specificity i.e. both models recognizes non-allergen correctly (almost 90%). The lower values for sensitivity when the models are tested on sets consisted of different kind of allergens corresponds with the data in literature that food allergens rarely cause respiratory reactions and inhalant allergens rarely affect the gut. The highest results for all of the three parameters was achieved by the kNN model trained with food allergens and validated with food test set. The model based on the aggregated training set shows good performance and its values for all of the parameters: sensitivity, specificity and accuracy are very close.

18 AllerTOP web tool for allergenicity prediction
Training set 1952 food, inhalant and others allergens and 1952 non-allergens ACC transformation of z descriptors kNN model external validation We implement the the KNN model based on aggregated training set with food, inhalant and others allergens in a web tool for online prediction of allergens. The server takes protein sequence in single letter format, transforms it to a vector with 45 ACC values and gives the output of the model for the tested protein. AllerTOP

19 Servers performance on united testset
United test set of 441 food and inhalant allergens and 441 non-allergens The performance of the servers on aggregated testset consisted of 441 allergens and 441 non-allergens is presented. Unfortunately, two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The highest results was achieved by Allerhunter and AlgPred server Allergen representing peptide method. The former even reached 100% specificity. The KNN model based on aggregated training set with 1952 allergens shows very stable results for specificity and sensitivity but this is not enough to reach to highest scores. Two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)

20 Conclusions An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed. 2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm. 4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 1. An alignment-free method for in silico prediction of allergens based on the main physicochemical properties of proteins was developed. 2.The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3.The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes algorithm. 4.The kNN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on: 5. The kNN algorithm was implemented on a web server, freely available on:

21 Drug Design Group School of Pharmacy Medical University of Sofia
Irini Doytchinova Ivan Dimitrov Mariyana Atanasova Panaiot Garnev Acknowledgements Darren R. Flower Aston University, Birmingham, UK Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009


Download ppt "School of Pharmacy Medical University of Sofia"

Similar presentations


Ads by Google