Transmembrane Protein Prediction Project Presentation CMPUT 606.

Transmembrane Protein Prediction Project Presentation CMPUT 606

Overview Transmembrane (TM) protein: Associated with the plasma membrane “A protein that has domains exposed on both sides of the membrane” [Genes VII] Some of the TM proteins that span the lipid layer several times form a hydrophilic channel that permits various ions and molecules to circulate through the plasma membrane.

Transmembrane Proteins

Transmembrane Segments

Ion Channels

Transmembrane Domains

Data Sets Brief Description DB-TMRDatabase of TM segments (not fasta). After translation into fasta: DB-TMR40672.fasta (TM segments flanked by 5 amino acids at each end of the segment) and DB-TMR40672onlytm.fasta (only TM segments). They each contain 40672 protein sequences. PDBDatabase of 3D structures in PDB format. After translation into fasta and removing all nucleotide sequences: pdb61042.fasta. The file PDBseqsnontm.fasta that contains 645 globular proteins constitutes a negative test set. As a result of testing the TMHMM predictor with the protein chains extracted from PDB, a file containing the prediction for each of the 61042 sequences was obtained, outputTMHMMonPDB61042.txt. Out of the 61042 sequences, 1294 were predicted to be TM. The sequence predicted as TM are stored in the file seqOutputTMHMM1294.fasta in fasta format (all entries are preceded by “sp|” at the beginning of each entry to mark them as TM for prediction and testing purposes). PDB_TMDatabase of TM proteins in XML format. The site provides lists of PDB IDs representing TM proteins (test_mem.txt) and globular proteins (globpdb.txt). From these initial list, the following fasta files are generated: tm.fasta with 916 TM protein chains, nontm.fasta with 900 protein chains, and both.fasta with 1816 protein chains. We have also two files obtained from the PDB_TM site, pdbtm_all.seq with 1363 chains. TMHMMset160This dataset is used to train TMHMM and comprises 160 protein sequences in fasta format. They are all TM proteins and are preceded by “sp|”. TMPDBDatabase of 302 verified transmembrane protein sequences, together with their TM domain location and number, in SwissProt format. After translation into fasta for all the TM categories: alpha helix non-redundant (231), alpha buried non-redundant (7), and beta non-redundant (15), the file obtained is the sum of these three files and it contains 253 protein sequences, tmpdb253.fasta in fasta format (“sp|”). TMbaseDatabase of transmembrane proteins and their helical membrane-spanning domains. It is mainly based on Swiss-Prot.

Predictors ePST bPST TMHMM TMpred HMMTOP HMMer TMDET

Predictors Major/Minor ContributionImpact TMHMMPredictor for TM helices. Based on TMHMM predictions, the authors estimated that 20-30 % of all genes in most genomes encode membrane proteins. In July 2001: rated best for prediction of TM helices. The accuracy reported is 97-98%. TMpredPredictor for membrane spanning regions and their orientation. The underlying algorithm is based on the statistical analysis of TMbase. The prediction is made using a combination of several weight- matrices for scoring. Still a reference comparison for TM protein prediction. HMMerSearches for homologues of a sequence family. Builds an HMM from the training data and matches the query sequence into a sequence database to find homologues. The model accepts as input a file on which MSA is performed. Improves upon the methods for sensitive database searches using multiple sequence alignments as queries. HMMTOPBuilds on an HMM architecture. The training model is a regularizer that is estimated from a set of known TM proteins. The prediction model is estimated from the query sequence and then it is used to predict the structure of that sequence. The server only accepts one test sequence at a time. The accuracy reported is 78%. TMDETPredictor for transmembrane domains. Based only on the structural information (3D) of the protein. Determines the membrane planes relative to the position of atomic coordinates. A discrimination function separates TM and globular proteins even in cases of low resolution or incomplete structures such as fragments or parts of large multi chain complexes. First algorithm that uses the 3D structure as input, identifies TM proteins, and determines membrane location. This method can be used to annotate protein structures having TM segments. Generates PDB_TM: automatically updated database for TM proteins from PDB. The algorithm can also construct a globular protein database. bPSTHistories are represented in the tree. Alternative approach for detecting significant patterns in protein sequences based on probabilistic suffix trees (PSTs) without any prior information about the input sequences and without the prior alignment of the input sequences. The PST model detects much more related sequences than pair- wise methods and it is much faster and almost as sensitive as an HMM. ePSTTraining sequences are represented in the tree. Prediction of the probability of a protein sequence function using an efficient PST is possible in linear time. Good results for protein function prediction.

Predictors Performance: Theoretical Time

TMHMM Short form prediction sp_1xqe_A len=418 ExpAA=243.54 First60=39.67 PredHel=11 Topology=o10-32i45-67o98-120i127-149o159-181i193-215o225- 247i259-281o285-302i315-337o352-374i

TMpred

HMMTOP

HMMer Flow

HMMer Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- nontm|1ALO._ OXIDOREDUCTASE -20.6 4.7 1 nontm|1CDE._ TRANSFERASE(FORMYL) -26.1 9.9 1 nontm|1AKO._ NUCLEASE -27.4 10 1 nontm|1ARU._ PEROXIDASE -37.1 10 1 sp|1pv7_A -41.7 10 1 sp|1pw4_A -46.0 10 1 sp|1pxs_A -48.9 10 1 sp|1xqe_A -49.0 10 1 sp|1r2c_L -53.2 10 1 nontm|1HSB.B HISTOCOMPATIBILITY -61.4 10 1 Parsed for domains: Sequence Domain seq-f seq-t hmm-f hmm-t score E-value -------- ------- ----- ----- ----- ----- ----- ------- nontm|1ALO._ 1/1 125 323.. 1 199 [] -20.6 4.7 nontm|1CDE._ 1/1 4 202.. 1 199 [] -26.1 9.9 nontm|1AKO._ 1/1 5 202.. 1 199 [] -27.4 10 nontm|1ARU._ 1/1 112 295.. 1 199 [] -37.1 10 sp|1pv7_A 1/1 116 314.. 1 199 [] -41.7 10 sp|1pw4_A 1/1 162 329.. 1 199 [] -46.0 10 sp|1pxs_A 1/1 51 249.] 1 199 [] -48.9 10 sp|1xqe_A 1/1 39 226.. 1 199 [] -49.0 10 sp|1r2c_L 1/1 62 260.. 1 199 [] -53.2 10 nontm|1HSB.B 1/1 2 99.] 1 199 [] -61.4 10

HMMer Total sequences searched: 10 Whole sequence top hits: tophits_s report: Total hits: 10 Satisfying E cutoff: 9 Total memory: 16K Domain top hits: tophits_s report: Total hits: 10 Satisfying E cutoff: 10 Total memory: 22K

ePST Output TM# Start End 1 12 24 2 50 61 3 101 112 4 130 142 5 163 166 6 168 175 7 199 201 8 203 211 9 228 240 10 260 271 11 287 297 12 315 333 13 353 365 Total # ePST segments = 13

ePST Output s# i char pos neg odds tot win maxwin region s 0 A -1.87 -708.40 706.52 706.52 706.52 0.00 - s 1 P -2.96 -708.40 705.44 1411.96 1411.96 0.00 - s 2 A -1.87 -708.40 706.52 2118.48 2118.48 0.00 - s 3 V -0.75 -708.40 707.64 2826.13 2826.13 0.00 - s 4 A -1.80 -708.40 706.60 3532.72 3532.72 0.00 - s 5 D -6.47 -708.40 701.92 4234.65 4234.65 0.00 - s 6 K -3.53 -708.40 704.87 4939.52 4939.52 0.00 - s 7 A -3.40 -708.40 705.00 5644.51 5644.51 0.00 - s 8 D -6.47 -708.40 701.92 6346.43 6346.43 0.00 - s 9 N -5.22 -708.40 703.18 7049.61 7049.61 0.00 - s 10 A -1.87 -708.40 706.52 7756.14 7756.14 0.00 - s 11 F -3.91 -708.40 704.49 8460.63 8460.63 0.00 - s 12 M -3.76 -708.40 704.63 9165.26 9165.26 0.00 - s 13 M -3.76 -708.40 704.63 9869.89 9869.89 0.00 - s 14 I -2.06 -708.40 706.34 10576.23 10576.23 0.00 - s 15 C -4.54 -708.40 703.86 11280.08 10573.56 10573.56 - s 16 T -2.71 -708.40 705.69 11985.77 10573.81 10573.81 - s 17 A -2.48 -708.40 705.91 12691.68 10573.20 10573.81 - s 18 L -4.01 -708.40 704.38 13396.07 10569.94 10573.81 - s 19 V -1.29 -708.40 707.11 14103.18 10570.45 10573.81 - s 20 L -0.59 -708.40 707.81 14810.99 10576.34 10576.34 - s 21 F -1.12 -708.40 707.28 15518.26 10578.75 10578.75 + s 22 M -3.76 -708.40 704.63 16222.90 10578.39 10578.75 + s 23 T -3.12 -708.40 705.27 16928.17 10581.74 10581.74 + s 24 I -0.87 -708.40 707.52 17635.69 10586.08 10586.08 + s 25 P -0.51 -708.40 707.89 18343.58 10587.44 10587.44 + s 26 G -2.25 -708.40 706.15 19049.73 10589.11 10589.11 + s 27 I -1.49 -708.40 706.91 19756.64 10591.38 10591.38 + s 28 A -1.54 -708.40 706.85 20463.50 10593.61 10593.61 + s 29 L -4.01 -708.40 704.38 21167.88 10591.65 10593.61 + s 30 F -1.92 -708.40 706.48 21874.36 10594.27 10594.27 + s 31 Y -6.07 -708.40 702.33 22576.69 10590.91 10594.27 + s 32 G -2.25 -708.40 706.15 23282.84 10591.15 10594.27 + s 33 G -4.38 -708.40 704.02 23986.86 10590.79 10594.27 + s 34 L -1.54 -708.40 706.85 24693.71 10590.53 10594.27 + s 35 I -2.06 -708.40 706.34 25400.05 10589.06 10594.27 + s 36 R -2.75 -708.40 705.65 26105.70 10587.43 10594.27 + s 37 G -2.25 -708.40 706.15 26811.85 10588.95 10594.27 +

ePST Execution Flow Training Set Testing Set ePST Prediction Post-processing Scripts TM# Start End 1 12 24 2 50 61 3 101 112 4 130 142 5 163 166 6 168 175 7 199 201 8 203 211 9 228 240 10 260 271 11 287 297 12 315 333 13 353 365 Total # segments predicted by ePST = 13

HMMer Results for both.fasta StepTime CLUSTALW41.37s (41.05s) hmmbuild0.56s (0.25s –f) hmmcalibrate5.47s (2.71s -f) hmmsearch1.56s (0.73s -f)

HMMer vs. ePST PredictorTrain Global Train Local Test Global Test Local Accuracy Global Accuracy Local HMMer6.03 (2.96)1.56 (0.73)240/916 = 26% ePST0.330.235.59 fp=fn=288 0.82 fp=fn=333 69%64%

ePST TrainingTestingLocal AccuracyGlobal Accuracy DBTMR40672q.fasta100% (W 15)100% DBTMR1000q.fasta60% (W 15) fp=2; fn=0 100% DBTMR40672both.fastafp=fn=269, 71%fp=fn=247, 73% DBTMR1000both.fastafp=fn=333, 64%fp=fn=288, 69% Set 160both.fastafp=fn=321, 65%72.55% (-), 73%(+)

Cross-validation (5 folds) - ePST Data SetTrain Global Train Local Test Global Test Local Accuracy Global Accuracy Local DBTMR40 672 7.157.200.360.42100% DBTMR10 00 0.13 0.01 100% tm.fasta0.60 0.01 100% both.fasta0.61 1.67 99% Set 1600.84 0.00 100%

TMHMM and ePST PredictorTestingLocal AccuracyGlobal Accuracy TMHMMboth.fasta99.11% (-), 60% (+) ePST trained on Set 160 both.fasta65%72.55% (-), 73%(+) ePST trained on mix.fasta both.fasta74% (W 15, 20), 78% (W 10, 35), 80% (W 25, 27) 78% ePST trained on Set 160 q.fasta100% TMHMMq.fasta100%

Scanning PDB Training: DMTMR40672 Testing: PDB Threshold 705.37->Nrtm=1665 chains PDB_TM retrieves 1673 chains Validation necessary – lack of ground truth

TMH Benchmark tmeval.fasta: 2247 non-annotated sequences Script for converting ePST output to TMH submit format Comparison with other predictors 4 tables 8 evaluation parameters

Window 25, 35, T 10584 - High Resolution

Window 25, 35, T 10584 - Low Resolution

Window 15, T 10588 – High Resolution

Window 15, T 10588 – Low Resolution

Window 15, T 10588 – False Positives

Window 15, T 10588 – Confusion with Signal Peptides

Conclusions ePST competitive predictor Fast training Scales well in contrast with HMMs ePST does not suffer from a poor local minimum as HMMs ePST does not require MSA of the sequences ePST allows more than one test sequence at a time

Future Work More tuning, use pruning Applications to other tasks (phosphorylation) involved in signal transduction pathways Search for a verified data set for training and testing (no consensus in the literature) Extract features from the sequence Analyze the false negatives with particular helix topologies (such as 1orq)

Transmembrane Protein Prediction Project Presentation CMPUT 606.

Similar presentations

Presentation on theme: "Transmembrane Protein Prediction Project Presentation CMPUT 606."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transmembrane Protein Prediction Project Presentation CMPUT 606.

Similar presentations

Presentation on theme: "Transmembrane Protein Prediction Project Presentation CMPUT 606."— Presentation transcript:

Similar presentations

About project

Feedback