Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.

Similar presentations


Presentation on theme: "Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New."— Presentation transcript:

1 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New Method for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003

2 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Roadmap Overview on secondary structure prediction Description of TXTpred method Experiment results and analysis Discussion and further work

3 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Secondary Structure of a Protein Sequence Dictionary of Secondary Structure Prediction annotates each residue with its structure (DSSP) –based on hydrogen bonding patterns and geometrical constraints 7 DSSP labels for PSS: –Helix types: H G (alpha-helix 3 / 10 helix) –Sheet types: B E (isolated beta-bridge strand) –Coil types: T _ S (Coil)

4 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Secondary Structure of a Protein Sequence Accuracy Limit ~ 88%

5 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Task Definition Given a protein sequence: –APAFSVSPASGA Predict its secondary structure sequence: –CCEEEEECCCCC –Focus on soluble proteins, not on membrane protein

6 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Overview of Previous Work -1 1 st -generation method –Calculate propensities for each amino acid E.g. Chou-Fasman method (Chou & Fasman, 1974) 2 nd -generation method –“Window” concept APAFSVSPAS (window size = 7) –Calculate propensities for segments of 3-51 amino acids E.g. GOR method (Garnier et al, 1978)

7 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Overview of Previous Work -2 3 rd -generation method –Use evolutional information multiple sequence alignment p-Value cut-off = 10 -2 PHD: Neural Network & Sequence features only (Rost & Sander, 1993) DSC: LDA & Biological features: GOR, hydrophobicity etc. (King & Sternberg, 1996) –Later Refinement Apply divergent sequence alignment: e.g. PROF ( Ouali & King, 2000 ) Combine results of different system: e.g. Jpred (Cuff & Barton, 1999) Bayesian Segmentation (Schmidler et al, 1999)

8 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Summary of Performance Method NamePerformance (Q3) Chou-Fasman~ 50% GOR~ 56% PHD~ 71% DSC~ 70%

9 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Disadvantage of Previous Work Most are “black box” predictors –Weak biological meanings Little focus on long-range interaction –Mostly focused on local information Performance is asymptotically bounded

10 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Roadmap Overview on secondary structure prediction Description of TXTpred method Experiment results and analysis Discussion and further work

11 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred Basic idea: –Build meaningful biological vocabulary –Apply language technique for prediction Major challenge: –How to build the vocabulary? Context-free N-gram of amino acids inside the window –Sq: APAFSVSPAS (window = 7) –N-gram: P, A,..,P, PA, AF,..SP, PAF, AFS,..,VSP

12 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Biological Vocabulary Context sensitive vocabulary –Analogy Same word might have different meanings: e.g. “bank” Same amino acid might have different properties: APAFSVSPAS –Encode context semantics into the N-gram Record the position information in the N-gram Example: APAFSVSPAS (window size = 7) –Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1

13 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Text Classification Text classification –Analogy The topic of a document is expressed by the words of the document The structure of one residue can be inferred from the biological words nearby –High Accuracy –Text Classification Technique Doc to Vectors: Classifiers: Support Vector Machines

14 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred Method Settings: Window = 17 One-gram, two-gram Feature Num = 3000

15 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

16 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Experimental Results RS126 datasets CB513 datasets

17 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Biological language Properties Power Law? One-gramTwo-gram Term Frequency = f(Rank)

18 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Helix Verification by Chou- Fasman parameters –Helix favors A, E, M, L, K (top 5 amino acids) –disfavors P (top 1 amino acid)

19 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Sheet Verification by Chou- Fasman parameters –Sheets favors V, I, Y, F, W (top 5 amino acids) – Disfavors D, E (top 2 amino acids)

20 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Coil Verification by Chou- Fasman parameters –Coil favors N, P, G, D, S (top 5 amino acids) –Disfavors V, I, L (top 3 amino acids)

21 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis –2 Word Correlation Word correlation Some words have strong correlation and co- occur frequently Technique: Singular Vector Decomposition Examples from texts Phrases: {president, Bush} Semantic correlated: {Olympic, sports}

22 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis – 2 Word Correlation Top ten correlated word pairs

23 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis – 2 Word Correlation Regular Expression Protein Sequence Secondary Structure Conjecture CPXXAISq1:ECPNEAIM Sq2:ECPAEAIK Sq3:GCPI PAIL L1: HCCCCCEC L2: HCCCCCEE L3: CCCCCEEE Coil connected to Sheet PGHSq1: TFPGHSA Sq2: DCPGHAD L1: CCCCCCC L2: ECCCHHH Coil EELSq1: DDEELLE Sq2: WSEELNS L1:CCHHHHH L2:CCHHHHH Helix

24 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Conclusion TXTpred Summary –Context sensitive biological vocabulary –Novel application of text classification to secondary structure prediction –Comparable performance for secondary structure prediction –Analysis provides reasonable biological meanings and structure indicators

25 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Future Work Deeper study on extracting more meaningful biological vocabulary Further discovery of new features, such as torsion angle and free energy Advanced learning models to consider long-range interactions Conditional random fields, Maximum entropy markov model

26 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Acknowledgement Vanathi Gopalakrishnan, Upitt Ivet Barhar, UPitt

27 Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Motivation for 2-D prediction Basis for three-dimensional structure prediction Improving other sequence and structure analysis –Sequence alignment –Threading and homologous modeling –Experimental data –Protein design


Download ppt "Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New."

Similar presentations


Ads by Google