Presentation on theme: "Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent."— Presentation transcript:
Using Support Vector Machines for transmembrane protein topology prediction Tim Nugent
Alpha-helical Transmembrane Proteins Transmembrane proteins fulfil many critical cellular functions. Comprise about 30% of the human proteome. Composed of hydrophobic, membrane-spanning alpha-helices, connected with loop regions. Poorly represented in structural databases. Predicting their structure and topology is therefore an important challenge for bioinformatics.
Transmembrane Protein Topology Topology of a transmembrane protein describes which portions of the amino- acid sequence lie within the plane of the surrounding lipid bilayer and which portions protrude into the watery environment on either side. Regions of the polypeptide chain span the membrane. Position of the N-terminal.
Identification of Transmembrane Regions To generate data for a plot, the protein sequence is scanned with a moving window of size residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window. Aquaporin V KGVWTQAFWKAVTAEFLAMLIFVLLSVGSTINWGGSEN
Discriminating between Inside and Outside Loops Hydrophobic: Val, Phe, Ile, Leu, Met. Positive: Lys, Arg, His. Cytoplasmic loops are enriched in positively charged residues: the 'positive-inside rule' of von Heijne
Using Evolutionary Information PSI-BLAST takes a single protein sequence as an input and compares it to a protein database. The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The profile is compared to the protein database, again seeking local alignments. PSI-BLAST estimates the statistical significance of the local alignments found. Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence
Using Support Vector Machines for Topology prediction Earlier approaches have relied on physiochemical properties such as hydrophobicity to identify transmembrane helices (e.g Kyte-Doolittle). Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, They have achieved significant improvements in prediction accuracy (~80%). However, none of the top scoring methods use SVMs. While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. TM helix / Loop Inside Loop / Outside Loop Signal Peptide / TM helix Re-entrant Loop / TM helix
Helix / Loop SVM Prediction Accuracy TM helix / Loop SVM: Database of 135 non-redundant protein sequences Jack knife cross-validation PSI-BLAST profiles Normalised by Z-score 33 residue sliding window Radial Basis Function Kernel: Gamma = 0.09, C = 0.8 SVM Mathews Correlation Coefficient = 0.82 TP=9129 FP=1351 TN=22140 FN=1320 Kyte-Doolittle MCC: 0.66 MEMSAT3 MMC: 0.76
SVM Results – Particulate Methane Monooxygenase subunit C
SVM Results – Cytochrome b6f subunit A
Further work Expand training set. Additional sequences where the TMH are known but the topology is not can be used to train the Helix/Loop classifier. Parameter optimisation. Window size Kernel type Transduction. Signal peptide SVM Re-entrant loop SVM. Combine SVM raw scores/probabilities into a topology.