Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.

Similar presentations


Presentation on theme: "Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan."— Presentation transcript:

1 Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan

2 2/29 Outline Introduction PSSP Motivation Knowledge-Based Method PROSP An Improved Hybrid Method PROSP II HYPROSP II+ Conclusion

3 3/29 Protein Structures Primary sequence Secondary structures Tertiary structures MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE helicesstrands loops Three dimensional packing of secondary structures

4 4/29 Introduction to PSSP Protein Secondary Structure Prediction (PSSP) is to predict protein secondary structure based only on its sequence. Each amino acid is assigned a structure element (SSE): Helix (H), Strand (E) or Coil (C or L).

5 5/29 Motivation PSSP plays an important role in tertiary structure predictions Fischer (1996) improved the tertiary structure prediction accuracy from 59.0 to 71.0 by using PHD to predict SSE. In Yang’s 2003, the tertiary structure prediction accuracy was improved from 71.9 to 79.0 by using PSIPRED to predict SSE. Predicted SSE can also be employed in other prediction algorithms as features to improve performance

6 6/29 Outline Introduction PSSP Motivation Knowledge-Based Method PROSP An Improved Hybrid Method PROSP II HYPROSP II+ Conclusion

7 7/29 Treat PSSP as a Translation Problem Secondary structure prediction A language of 20 alphabets A language of 3 alphabets

8 8/29 Treating Genomic/Proteomic sequences as a Language For proteomic data: Amino acid motif protein Alphabet word sentence paragraph Protein structure or function Sentence meaning Finding the interrelationships of data Data Mining, Knowledge Discovery

9 9/29

10 Speech Recognition ─ Example Sense Disambiguation in English Selection of homonyms (or senses) in speech recognition 台 北 市 一 位 小 孩 走 失 了 台 北 市 小 孩 台 北 適 宜 走 失 事 宜 一 位 一 味 移 位

11 11/29 How do we represent the context in a protein sequence (or sentence)? Using motifs as Words? Motifs could be too specific, do not provide enough coverage What about using k-mers? Can build (k-mer, structure) pairs How many k-mers can we get? How do we define similar k-mers? (under the context) How do we combine the structural information from the k-mers?

12 12/29 PROSP Our knowledge-based method for PSSP Constructing a peptide Sequence-Structure Knowledge Base (SSKB) Use PSI-BLAST to find all peptides similar to those of the target protein Use similar peptides found in the SSKB to vote for the dominant structure of each amino acid in the target protein.

13 13/29 Using PSI-BLAST to Amplify the Effect of DSSP Database (create more synonyms) The number of peptide words is still small (~ 5 million) Identify similar peptides For each protein p in the NR database, apply PSI- BLAST to find its HSPs (high score segment pairs). HSP: an alignment of subsequence of protein p and another protein q with unknown structure Assign the structure of “selected” peptides of p to those of q These peptides comprise our dictionary (~ 100 million)

14 14/29 SSKB construction (synonyms) An example of High-scoring Segment Pair (HSP) from PSI-Blast Search result known unknown

15 15/29 … x H(x) E(x) C(x) Voting score x is assigned as helix H H H C E C SSKB PSI-Blast Prediction at a position x

16 16/29 Outline Introduction PSSP Motivation Knowledge-Based Method PROSP An Improved Hybrid Method PROSP II HYPROSP II+ Conclusion

17 17/29 Two problems of searching for homologous peptides in protein sequences databases Redundant information generated by duplicate peptides The voting bias problem in PROSP Poor prediction accuracy due to insufficient knowledgebase matching boost coverage

18 18/29 The voting bias problem Query Sbject The PSIBLAST results KTYQCQY … KPYQCQY KVYQCQY QPYRCKY SSKB KTYQCQY … HHHHHH CCHHHC Dominate result

19 19/29 Clustering HSPs …MYKKILYPTDFSETAEIALK… MYSKIL L MYKKI YL MYSSI LY Similar HSPs

20 20/29 Measuring the amount of structural information Low Local match rate HSPs There is no information from SSKB 7 for this region Found Unfound

21 21/29 Construct SSKB with different lengths (to boost coverage) HSPs Training Protein PSI-BLAST search SSKB window length = 7 SSKB construction window length = 7 HSPs Training Protein PSI-BLAST search SSKB window length = 5 SSKB construction window length = 5

22 22/29 HSPs from SSKB 7 Boost match rate using different length peptide record Protein : MYKKILYPTDFSETAEIALK … SSKB Window length = 7 SSKB Window length = 7 SSKB Window length = 5 SSKB Window length = 5 H H 1 2 1 3 6 7 8… E E 1 2 2 0 0 0 1… C C 2 3 8 8 5 4 2… H H 1 3 2 5 5 5 2… E E 1 3 2 0 0 0 1… C C 2 4 7 7 6 6 7… HSPs from SSKB 5

23 23/29 NEW PROSP system Protein : MYKKILYPTDFSETAEIALK … SSKB Window length = 7 SSKB Window length = 7 SSKB Window length = 5 SSKB Window length = 5 H H 1 2 1 3 6 7 8… E E 1 2 2 0 0 0 1… C C 2 3 8 8 5 4 2… H H 1 3 2 5 5 5 2… E E 1 3 2 0 0 0 1… C C 2 4 7 7 6 6 7… H PROSPII (x) ← LMR 7mer (x)×H 7 (x)+(1- LMR 7mer (x))×H 5 (x) E PROSPII (x) ← LMR 7mer (x)×E 7 (x)+(1- LMR 7mer (x))×E 5 (x) C PROSPII (x) ← LMR 7mer (x)×C 7 (x)+(1- LMR 7mer (x))×C 5 (x) H H 1 3 2 5 7 6 7… E E 1 3 2 0 0 0 1… C C 2 4 8 8 4 5 6…

24 24/29 Hybrid by Neural Network Query Protein PSIPRED PROSP PSIPBLAST H score E score C score H score E score C score PSSM Neural Network Final Result 3 features 20 features

25 25/29 Data Sets Two broadly used test sets CB513 EVAc4 Derivation of the training sets Get 4,572 unique protein chains (with less than 25% mutual sequence identity) from DSSP database Further remove protein chains of sequence identity over 25% with the respective test datasets to obtain their respective training datasets. The final training datasets consist of 4395 and 4055 protein chains for EVAc4 and CB513, respectively.

26 26/29 The respective performance improvement using SSKB 5 and SSKB 7 LMR 7mer (%) Q 3 (%) Performance of prediction on CB513 by SSKB 5, SSKB 7 and PROSP II with respect to LMR 7mer lower than 50%.

27 27/29 Performance of HYPROSP II+

28 28/29 Conclusion HYPROSP II+ Using a more robust knowledge-based algorithm PROSP II More structural information, better prediction. Incremental Learning The general strategy developed in this paper could be used to enhance the performance of similar approaches in other prediction problems.

29 People Wen-Lian Hsu Ting-Yi Sung Hsin-Nan Lin Jia-Ming Chang Ei-Wen Yang


Download ppt "Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan."

Similar presentations


Ads by Google