A Hidden Markov Model for Protein Secondary Structure Prediction Wei-Mou Zheng Institute of Theoretical Physics Academia Sinica PO Box 2735, Beijing 100080 zheng@itp.ac.cn
Outline Protein structure A brief review of secondary structure prediction Hidden Markov model: simple-minded Hidden Markov model: realistic Discussion References
Hydrophobic Charged+- Polar Protein sequences are written in 20 letters (20 Naturally-occurring amino acid residues): AVCDE FGHIW KLMNY PQRST Hydrophobic Charged+- Polar
Residues form a directed chain Cis- Trans-
H: E: C = 34.9: 21.8: 43.3 Rasmol ribbon diagram of GB1 Helix (pink), sheets (yellow) and coil (grey) Hydrogen-bond network 3D structure → secondary structure written in three letters:H, E, C. H: E: C = 34.9: 21.8: 43.3
Bayes formula Count of Generally, P(x, y) = P(x|y)P(y),
Protein sequence A, {ai}, i=1,2,…,n Secondary structure sequence S, {si}, i=1,2,…,n Secendary structure prediction: 1D amino acid sequences → 1D secondary structure sequence An old problem for more than 30 years Inference of S from A: P(S |A ) 1. Simple Chou-fasman approach Chou-Fasman’s propensity of amino acid to conformational state + independence approximation
Parameter Training Propensities q(a,s) Counts (20x3) from a database: N(a, s) sum over a → N(s), sum over s → N(a), sum over a and s → N q(a,s) = [N(a,s) N] / [N(a) N(s)].
2. Garnier-Osguthorpe-Robson (GOR) window version Conditional Independency Weight matrix (20x17)x3 P(W|s) 3. Improved GOR (20x20x16x3, to include pair correlation)
Hidden Markov Model (HMM): simple-minded Bayesian formula: P(S|A) = P(S,A)/P(A) ~ P(S,A) = P(A|S) P(S) Simple version emitting ai at si Markov chain according to P(a|s) For hidden sequence Forward and backward functions a1 a2 a3 s1 s2 s3
Initial conditions and recursion relations Partition function Linear algorithm: Dynamic programming Baum-Welch (sum) & Viterbi (max)
Prob(si=s, si+1=s’) = Ai(s) tss’ P(ai+1|s’) Bi+1(s’)/Z Prob(si:j)
Hidden Markov Model: Realistic 1) Strong correlation in conformational states: at least two consicutive E and three consicutive H refined conformational states (243 → 75) 2) Emission probabilities → improved window scores Proportion of accurately predicted sites ~ 70% (compared with < 65% for prediction based on a single sequence) No post-prediction filtering Integrated (overall) estimation of refined conformation states Measure of prediction confidence
Discussions HMM using refined conformational states and window scores is efficient for protein secondary structure prediction. Better score system should cover more correlation between conformation and sequence. Combining homologous information will improve the prediction accuracy. From secondary structure to 3D structure (structure codes: discretized 3D conformational states)
References Lawrence R Rabiner, A tutorial on hidden Markov models and selected appllications in speech recognition Proceeding of the IEEE, 77 (1989) 257-286 Burkhard Rost Protein Secondary Structure Prediction Continues to Rise Journal of Structural Biology 134, 204–218 (2001)
The End
Small P Tiny G I A V Aliphatic L C S N T D Q M E Y K F H R Negative W Positive Aromatic Hydrophobic Polar