Presentation on theme: "Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent."— Presentation transcript:
Support Vector Machine-based Transmembrane Protein Topology Prediction Tim Nugent
Alpha-helical Transmembrane Proteins Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome. Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. Poorly represented in structural databases. Predicting their structure and topology is therefore an important challenge for bioinformatics.
Transmembrane Protein Topology Topology of a transmembrane protein describes which regions are membrane-spanning and which are 'inside' or 'outside' (e.g. cytoplasmic/extracellular or cytoplasmic/lumenal). Number and position of TM helices. Position of the N-terminal.
Early Hydrophobicity-based Approaches To generate data for a plot, the protein sequence is scanned with a moving window of size residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window. Aquaporin KGVWTQAFWKA V TAEFLAMLIFVLLSVGSTINWGGSEN
Discriminating between Inside and Outside Loops Hydrophobic: Val, Phe, Ile, Leu, Met. Positive: Lys, Arg, His. Cytoplasmic loops are enriched in positively charged residues: the 'positive-inside rule' of von Heijne
Machine Learning-based Approaches
Using Support Vector Machines for Topology Prediction Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, They have achieved significant improvements in prediction accuracy (~80%). However, none of the top scoring methods use SVMs. While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. TM helix / Loop Inside Loop / Outside Loop Signal Peptide / ¬Signal Peptide Re-entrant Loop / ¬Re-entrant Loop
Assembling a Novel Data Set of Transmembrane Proteins In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. Remove fragments, chain breaks, colicins, venoms etc. Homology reduce at 40% sequence identity. Topologies determined by OPM or PDB_TM. Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. OPM uses water-lipid transfer energy minimisation PDB_TM uses hydrophobicity/structural feature analysis
Assembling a Novel Data Set of Transmembrane Proteins Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.
Re-entrant Helices Re-entrant helices in Aquaporin Z (left) from Escherichia coli (PDB code 1rc2) and Potassium channel (right) from Bacillus cereus (PDB code 2ahy) marked with black arrows.
MySQL Table Schema
Data Set Composition
Support Vector Machine Training Data set of 131 non-redundant protein sequences. Jack knife cross-validation - sequences with >25% sequence identity removed from training sets. Signal peptide SVM – 10-fold cross validation + additional data from Phobius set and SWISS-PROT. PSI-BLAST profiles vs Uniref 90. E-value threshold for inclusion = Normalise by Z-score residue sliding window. Transduction. Optimise window size, kernel choice and parameters using Mathew's Correlation Coefficient:
Per Residue SVM Prediction Accuracy
Dynamic Programming Simplified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. Re-entrant helix and signal peptide states were added. Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score
Overall Prediction Accuracy Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.
Glycerol uptake facilitator
ABC transporter BtuCD
Discriminating between TM and Globular Proteins For SVM training, we used 416 randomly chosen proteins from the MEMSAT3  set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. Window size = 33, Kernel = RBF, MCC = 0.78
Whole Genome Analysis
Conclusions Novel SVM-based approach predicts correct topology with 88% accuracy, 9% higher than next best method OCTOPUS. Incorporates signal peptide and re-entrant helix prediction. Signal peptide containing proteins correctly predicted with 92% accuracy. Re-entrant helix containing proteins correctly predicted with 55% accuracy – room for improvement. Good TM/globular protein discrimination – combined with SP prediction, highly suited to whole genome analysis. Further work SVM to predict amphipathic/pore-forming helices.