Presentation is loading. Please wait.

Presentation is loading. Please wait.

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.

Similar presentations


Presentation on theme: "Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic."— Presentation transcript:

1 Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

2 2 Contents 1.Introduction – Cell Organelles and Proteins Subcellular Localization – Signal-Based vs. Homology-Based Methods 2.Speeding Up the Prediction Process – Predicting Cleaving Site Location – Truncating Profiles vs. Truncating Sequences – Perturbational Discriminant Analysis 3.Experiments and Results 4.Conclusions

3 3 Organelles Cells have a set of organelles that are specialized for carrying out one or more vital functions. Proteins must be transported to the correct organelles of a cell to properly perform their functions. Therefore, knowing the subcellular localization is one step towards understanding the functions of proteins.

4 4 Proteins and Their Subcellular Location

5 5 Subcellular Localization Prediction Two key methods: 1.Signal-based 2.Homology-based

6 6 Signal-Based Method Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. The amino acid sequence of a protein contains information about its organelle destination. Typically, the information can be found within a short segment of 20 to 100 amino acids preceding the cleavage site. Signal-based methods (e.g. TargetP) can determine the cleavage site location Cleavage site

7 7 Full-length Query Sequence S (1) =KNKA··· S (2) =KAKN··· · S (N) =KGLL··· Full-length Training sequences Align with each of the training sequences...... SVM classifier N-dim alignment vector Subcellular Location 1 N Advantage: Can predict sequences that do not have cleavage sites. Drawback: Given a query sequence, we need to align it with every training sequence in the training set, causing long computation time. Homology-Based Method

8 8 21 8 Sequences Length Distribution Many sequences are fairly long, thus, aligning the whole sequence will take long computation time. cTP, mTP and SP are under 100 AAs only and contain the most relevant segment. Computation saving can be achieved by aligning the signal segments only. Occurrences of Seq. Length distribution of Seq. Sequence Length SP 820 Ext: Mit: Chl: 35 mTP 1050 18 cTP 760 Cleavage Site

9 9 Proposed Method: Aligning the Segments that Contain the Most Relevant Info. Signal-based Cleavage Site Predictor (e.g. TargetP) N truncate Homology-based Method Subcellular Location C Amino Acid Sequence … Truncated sequence Cleavage Site

10 10 Aligning Profiles Vs. Aligning Sequences Query Sequence Scheme I : Truncate the profiles Scheme II : Truncate the sequences

11 11 Perturbational Discriminant Analysis Input Space Hilbert Space Input and Hilbert Spaces: Empirical Space: Empirical Space

12 12 Perturbational Discriminant Analysis The objective of PDA is to find an optimal discriminant function in the Hilbert space or empirical space: The optimal solution (see derivation in paper) in the empirical space is ρ represents the noise (uncertainty) level in the measurement. It also ensures numerical stability of the matrix inverse. Ρ = 1 in this work.

13 13 Perturbational Discriminant Analysis 3 classes of 2-dim data in the input space RBF kernal matrix K Projection onto the 2-dim PDA space Decision boundaries in the input space Example on 2-D Data

14 14 Perturbational Discriminant Analysis Application to Sequence Classification Training sequences PSI-BLAST Pairwise Alignment Compute PDA Para Training Profiles K Test sequence PSI-BLAST Align with Training Profiles Compute PDA Score Test Profile

15 15 Perturbational Discriminant Analysis Application to Multi-Class Problems 1-vs-Rest PDA Classifier: MAXNET

16 16 Perturbational Discriminant Analysis Application to Multi-Class Problems Cascaded PDA-SVM Classifier: Test sequence Project onto (C–1)-dim PDA space 1-vs-rest SVM Classifier Class label

17 17 Experiments Materials: Eukaryotic sequences extracted from Swiss-Prot 57.5 Ext, Mit, and Chl contain experimentally determined cleavage sites 25% Sequence identity (based on BLASTclust) Performance Evaluation: 5-Fold cross validation Prediction accuracy and Matthew’s correlation coefficient (MCC)

18 18 Query Sequence Kernel matrix (Scheme I) Kernel matrix (Scheme II) Comparing Kernel Matrices

19 19 Sensitivity Analysis The localization performance degrades when the cut-off position drifts away from the ground-truth cleavage site. mTP and cTP are more sensitive to the error of cleavage site prediction than Ext. 19 Cut-off Position p-16p-8 p-2 p p+2 p+16 p+32 p+64 Ground-truth cleavage site Cyt/Nuc Overall Mit Chl Ext Cut Seq. at p ± x p: gournd-truth cleave site Subcellular localization (PairProSVM) Subcellular location Seq Subcellular Localiation Accuracy (%)

20 20 Performance of Cleavage Site Prediction Conditional Random Field (CRF) is better than TargetP(Plant) in terms of predicting the cleavage sites of signal peptide (Ext) but is worse than TargetP(Nonplant). CRF is slightly inferior to TargetP in predicting the cleavage sites of mitochondria, but it is significantly better than TargetP in predicting the cleavage site of chloroplasts. 20 TargetP(Plant) TargetP(NonPlant) CRF Csite Prediction ACC(%) Category

21 21 Findings:Profile creation time can be substantially reduced by truncating the protein sequences at the cleavage sites. Comparing Profile Creation Time Query Sequence short profile sequence short Cut SVM or KPDA Pairwise Alignment PSI- BLAST Subcellular Location Score Vector short profile Long PSI- BLAST SVM or KPDA Pairwise Alignment Cut Subcellular Location Score Vector Scheme I Scheme II short profile sequence short Cut SVM or KPDA Pairwise Alignment PSI- BLAST Subcellular Location Score Vector short profile Long PSI- BLAST SVM or KPDA Pairwise Alignment Cut Subcellular Location Score Vector Scheme I Scheme II Query Sequence

22 22 Findings:The training time of 1-vs-rest PDA and Cascaded PDA- SVM are substantially shorter than that of SVM. Training and Classification Time Project onto (C–1)-dim PDA space 1-vs-rest SVM Classifier

23 23 Findings:In terms of localization accuracy, the proposed “Signal+Homology” method performs slightly better than the signal-based TargetP and is substantially better than the homology-based SubLoc. Compare with State-of-the-Art Localization Predictors Conditional Random Fields Localization Accuracy (%) MCC

24 24 Conclusion Fast subcellular-localization-prediction can be achieved by a cascaded fusion of signal-based and homology-based methods. As far as localization accuracy is concerned, it does not matter whether we truncate the sequences or truncate the profiles. However, truncating the sequence can save the profile creation time by 6 folds. 24

25 25 Compare with State-of-the-Art Localization Predictors

26 26 Performance of Cascaded Fusion The computation time for full-length profile alignment is a striking 116 hours Our method not only leads to nearly a 20 folds reduction in computation time but also boosts the prediction performance. Full- length Seq. Seq. with Csite predicted by TargetP(P) Seq. with Csite predicted by TargetP(N) Seq. with Csite predicted by CRF 26 Time (hr.) Time Subcellular localization accuracy Acc (%)

27 27 1) Cleavage site detection. The cleavage site (if any) of a query sequence is determined by a signal-based method. 2) Pre-sequence selection. The pre-sequence of the query is obtained by selecting from the N-terminal up to the cleavage site. 3) Pairwise alignment. The pre-sequence is aligned with each of the training pre-sequences to form an N-dim vector, which is fed to a one-vs-rest SVM classifier for prediction. 27 Fusion of Signal- and Homology-Based Methods

28 28 Perturbational Discriminant Analysis Spectral Space: Define the kernel matrix K can be factorized via spectral decomposition into Empirical SpaceSpectral Space


Download ppt "Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic."

Similar presentations


Ads by Google