Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L.

Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L. Brinkman Dept. of Mol. Biology & Biochemistry Simon Fraser University, BC Canada

2 1. Problem Introduction Gram-negative bacteria  Medically important disease-causing bacteria  5 sub-cellular localizations (2 layers of membranes) 1. Cytoplasmic 2. Inner Membrane 3. Periplasmic 4. Outer Membrane 5. Extra-cellular

3 Outer Membrane Proteins Predicting outer membrane proteins (OMPs) of Gram-negative bacteria attached to the “outer membrane” of Gram- negative bacterial cell Particularly useful as drug target

4 Outer Membrane Protein (Cont.) structure:  -strands, form central barrel shape Inner turns, shorter stretches Outer loops, longer stretches Outer membrane Extracellular side Periplasmic side Outer loop Inner turn  -strand

5 Challenges Identifying OMPs from sequence information alone Discriminative sequence patterns of OMPs would be helpful

6 Challenges (Cont.) favor precision over recall lengthy time and laborious effort to study targeted drug in lab Actual OMPActual non-OMPSubtotal Classified as OMPTPFPA Classified as non-OMPFNTNB SubtotalCDE Confusion Matrix Overall accuracy = (TP+TN) / E Precision = TP / A Recall = TP / C

7 2. Dataset OMP sequence dataset Excellent quality (http://www.psort.org/dataset)http://www.psort.org/dataset Protein sequences (strings over alphabet of 20 letters) e.g. MNQIHK… Two classes with imbalanced distributions DataNumber of sequences Percentage of each class Minimum length Maximum length Average length OMP42727.4%913705571.1 Non-OMP113272.6%501034256.8 Total1559342.9

8 Evaluation Majority of data is non-OMP, overall accuracy is determined mainly by non-OMP prediction; Precision is our main concern (  90%); Recall should be maintained at reasonable level (  50%).

9 3. Related Work Existing sub-cellular localization predictors Inner membrane proteins have  -helix structures – prediction is highly accurate Prediction of cytoplasmic, periplasmic and extracellular proteins  neural networks, covariant discriminate algorithm, Markov chain models, support vector machines (highest accuracy: 91%)  Do not apply to OMPs

10 Existing work on OMP prediction Neural networks, Hydrophobicity analysis, Combination of methods (homology analysis, amino acid abundance) Current state-of-the-art Hidden Markov Models by Martelli et al. [1][1] Use HMM to model OMPs according to their 3D structures Training set is small (12 proteins with known 3D structures) Overall accuracy: 89%; Recall: 84%; Precision: 46%.

11 4. Algorithms Motivations Frequent subsequence mining is helpful  frequent subsequence: consecutive amino acids that occur frequently in OMPs OMP characteristics  Common structure in OMPs  Different regions have different characteristic sequence residues  Model local similarities by frequent subsequences and highly variable regions by wild cards (*X*X*…) => Association-Rule-based classification

12 Algorithm 1: Rule-Based Classification Mine frequent subsequences X (consecutive amino acids) only from OMP class (support(X)  MinSup). Remove trivial similarities by restricting minimum length (MinLgh) of frequent subsequences Find frequent patterns (*X*X*…) Build classifier using frequent pattern rules (*X*X*…  OMP).

13 Algorithm 1: Refined The previous classifier performs good in precision, but poor in recall A second level of classifier is built on top of the existing classifier New training data: cases covered by the default rule in the first classifier Apply same pattern-mining and classifier-building process Future case is first matched against the 1 st classifier; if it is classified as OMP, we accept it; otherwise the 2 nd classifier is used.

14 Algorithm 2: SVM-based Classification Support Vector Machines (SVM) [5][5] Excellent performer in previous biological sequence classification Data needs to be transformed for SVM to be used (sequences => vectors) Frequent subsequences of OMPs are used as features. Protein sequences are mapped into binary vectors.

15 5. Empirical Studies 5 Classification methods Single-level Rule-Based Classification (SRB) Refined Rule-Based Classification (RRB) SVM-based Classification (SVM-light [6] ) [6] Martelli’s HMM See5 (latest version of C4.5) 5-fold cross validation (same folding for all algorithms)

16 Summary of Classifier Comparison SVM outperforms all methods RRB is the 2nd best performer Both SVM and RRB outperform HMM Improvement from SRB to RRB shows that refinement works

17 Other Biological Benefits (Rule-Based Classifiers) Sequential rules (obtained by SRB/RRB) lead to biological insights Mapped to both β-strands and periplasmic turn regions Assist in developing 3D models for proteins Identification of primary drug target regions conserved sequences in the surface-exposed regions are ideal targets for new diagnostics and drugs

18 6. Conclusions and Future Work Contributions Provide excellent predictors for OMP prediction; Obtained interpretable sequential patterns for further biological benefits; Proposed the use of frequent subsequences for SVM feature extraction; Demonstrated the usefulness of data mining techniques in biological sequence analysis.

19 Future Work Include more features of sequences, e.g., secondary structure, additional properties of proteins, barrel and turn sizes, polarity of amino acides, etc. Explore ways to extract symbolic information from SVMs

20 References 1. Martelli P., Fariselli P., Krogh A. and Casadio R., A sequence-profile-based HMM for predicting and discrimating  barrel membrane proteins, Bioinformatics, 18(1) 2002, S46-S53, 2002. 2. Wang J., Chirn G., Marr T., Shapiro B., Shasha D. and Zhang K., Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results, SIGMOD-94, Minnesota, USA, 1994. 3. Wang K., Zhou S. and He Y., Growing Decision Tree on Support-less Association Rules, KDD’00, Boston, MA, USA, 2000. 4. Quinlan J., C4.5: programs for machine learning, Morgan Kaufmann Publishers, 1993. 5. Vapnik V., The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. 6. Joachims T., Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002. software downloadable at http://svmlight.joachims.org/http://svmlight.joachims.org/ 7. Rulequest Research, Information on See5/C5.0, at http://www.rulequest.com/see5- info.htmlhttp://www.rulequest.com/see5- info.html

Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L.

Similar presentations

Presentation on theme: "Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L.

Similar presentations

Presentation on theme: "Frequent-Subsequence-Based Prediction of Outer Membrane Proteins R. She, F. Chen, K. Wang, M. Ester, School of Computing Science J. L. Gardy, F. S. L."— Presentation transcript:

Similar presentations

About project

Feedback