Presentation on theme: "RNA Secondary Structure Prediction Using Stochastic Context –Free Grammar And Evolutionary History B. Knudsen and J. Hein Department of Genetics and Ecology."— Presentation transcript:
RNA Secondary Structure Prediction Using Stochastic Context –Free Grammar And Evolutionary History B. Knudsen and J. Hein Department of Genetics and Ecology The institute of Biological Sciences University of Aarhus, Denmark Presented by Jing Cui Nov.22, 2002
Outline of the Lecture Introduction Algorithms The grammar Probabilities of columns Probabilities of an alignment The full model Implementation The database Frequencies Mutation rates Grammar parameters Results The test sequences Using related sequences Neglecting phylogeny Weight of results Comparison with other methods Conclusion The limitations The improvements
Introduction Single sequence e.g. Zuker(1989) using prior information on RNA structures, through energy functions not ideal when estimating structures of sequences with known homologs Multiple sequences 1.Covariance methods (Eddy and Durbin, 1994) 2.Profile stochastic context-free grammars (SCFGs Sakakibara et al. 1994) Characteristics: do not explicitly take phylogeny into account, and do not use a prior probability distribution of structures Maximum weighted matching methods (Cary and Stormo, 1995; Tabaska et al. 1998) share the above characteristics
The method used in this paper uses prior knowledge about RNA structure in making a maximum a posteriori (MAP) estimation of the 2 nd structure. Performed on an alignment of sequences assumed to have identical 2 nd structure, i.e. the alignment is assumed to be a structural alignment. Take the phylogenetic tree of the sequences into account, including branch lengths, using a model of mutation processes in RNA. The tree can be estimated by a maximum likelihood (ML) method. Originating from Goldman et al. (1996) Predicting protein 2 nd structure using HMMs including phylogenetic information Difference 2 nd structure in RNA are not local, like in proteins SCFGs instead of HMM is used here Limitation SCFGs are unable to model crossing interactions, thus pseudoknots cannot be predicted
Algorithm Input an alignment of RNA sequences Output single common structure for the sequences The model The SCFG The evolutionary model
The grammar a set of variables;some terminal and non-terminal
Probabilities of columns Given the tree A column of non-paring bases is independent of the other columns Two paring columns is assumed to be independent of any other columns P = (pA, pU, pG, pC) the distribution of bases in loop regions of RNA sequences The rate matrix For base pair (16 by 16) rate matrix Given a tree, including branch lengths, the column probabilities are calculated using post-order traversal as described by Felsenstein (1981) Reversibility of mutations
Probability of an alignment The input data: D=(C 1, C 2, …,C l ) The model: M The tree: T 2 nd structure:σ s: a single base d: a left column of pairs d c : the right column of the pair
The core model The SCFG The evolutionary model The grammar is equivalent to a grammar that generates column in alignments instead of just secondary structure, meaning that for a two-sequence alignment, the production rule covers the following rules:
The full model The ML estimate of the tree, given the model (If no phylogenetic tree) MAP (Maximum a posteriori) estimation of the most likely 2 nd structure by Bayes theorem where P(σ|T,M) is the prior distribution of structures given by the SCFG
Implementation The database The database used for estimating this model should represent RNA structure in general. The database should be composed of various types of RNA. tRNAs database by Sprinzl et al. (1998) ribosomal RNAs (LSU rRNAs) by De Rijk et al. (1998)
Frequencies The single base frequencies were estimated from counts of the bases in the single base positions of the sequences. Base pair frequencies were estimated by counting base pairs.
Mutation rates For a given pair, P, t p : the time between sequences N p : the number of columns in the two-sequence alignment P s : the prob. of a base being in a single base position
Grammar parameters by inside-outside algorithm (an expectation maximization procedure) on the training set et of secondary structure (Baker, 1979; Lari and Young, 1990) This is just like the forward-backward algorithm in HMM !!!
Results The test sequences 4 bacterial RNase P RNA seq. alignment: 385 columns pair-wise sequence identities 65-92%
Pseudoknot and ; and At least 22 positions wrongly predicted in each sequence
Using related sequences
Weight of results by inside and outside variables, calculate the probability that each position is correctly predicted. How certainty the predictions are, assuming that the model is correct.
Comparison with other methods The energy minimization method has more parameters, better results COVE (Eddy and Durbin, 1994) with lower accuracy This shows the significance of the method described here in situations where only a few sequences are known.
Conclusion Limitations Inability to predict pseudoknots. Loop and stem lengths are assumed to be geometrically distributed The nature of the specific SCFG used here A good alignment is needed – hard to solve The dynamical programming algorithms are relatively slow. [They have a time complexity of O(N 3 ) with respect to the length of the alignment.]
Possible improvements 1.Profile SCFGs and covariance models predict 2 nd structure at the same time as making alignments 2.Modeling base stacking 3.The evolutionary model 4.Reduce the number of parameters for the rate matrix Conclusion