Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.

Similar presentations


Presentation on theme: "Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel."— Presentation transcript:

1 Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel

2 2 Introduction Protein: linear sequence of amino acids Protein subcellular localization Plant: nuclear, cytoplamic, mitochondria, extracellular, … Intracellular vs. Extracellular Sequence information alone Class imbalance Transparency

3 3 Related Word N-terminal sorting signals Amino acid composition Lexical analysis Integrative approach Subsequence methods

4 4 Predicting Extracellular Proteins Feature Extraction Support Vector Machine Boosting Frequent Pattern Method

5 5 Feature Extraction Frequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins Strong discriminative power Perform similar functions via relationed biochemical mechanism Capture local similarity

6 6 Generalized Suffix Tree

7 7 Support Vector Machine Input data represented as feature vectors Find a linear separator that separate the data and maximize the margin Kernel function: nonlinear separator

8 8 SVM for extracellular protein prediction Data Transformation(sequence  vector) Frequent subsequences as features Transform protein sequence as binary vectors Kernel Functions Linear kernel Polynomial kernel RBF kernel

9 9 Boosting Iterative algorithms to improve weak classifier Different weighted distribution of examples in each iteration Increase the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

10 10 AdaBoost

11 11 Frequent Pattern Method Frequent pattern: *X1*X2*…*Xn*  extracellular X1,X2,…Xn are frequent subsequences “*” can be substituted to zero or up to MaxGap amino acids when matching a protein sequence

12 12 FOIL algorithm

13 13 Z-number :accuracy of rule R :support of rule R

14 14

15 15 Experiments Dataset(PASub project at UofA) Plant: 3293 proteins, 171 extracellular Five-cross validation

16 16 Evaluation Matrix Overall accuracy is not good enough F-measure

17 17 Result(SVM with subsequence)

18 18 Result(Boosting with subsequence)

19 19 Result(Frequent Pattern) MinLen=3 Min_gain=0.1 MinSup=5% MinConf=80% MaxGap=300

20 20 Result(SVM with composition)

21 21 Result(Boosting with composition)

22 22 Cross Comparision

23 23 SVM with combined features

24 24 Boosting with combined features

25 25 Effects of MinLen on SVM

26 26 Effects of MinLen on boosting

27 27 Conclusion Presented three methods for identifying extracellular proteins based on frequent subsequence of amino acids SVM achieves the best result FSP method provides easily interpretable rules

28 28 Future Work Use for information about proteins (e.g., structure, function, …) Integrating amino acid composition into FSP method Incorporate more biological knowledge


Download ppt "Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel."

Similar presentations


Ads by Google