Presentation on theme: "B. Knudsen and J. Hein Department of Genetics and Ecology"— Presentation transcript:
1RNA Secondary Structure Prediction Using Stochastic Context –Free Grammar And Evolutionary History B. Knudsen and J. HeinDepartment of Genetics and EcologyThe institute of Biological SciencesUniversity of Aarhus, DenmarkPresented by Jing CuiNov.22, 2002
2Outline of the Lecture Introduction Results Algorithms Implementation The grammarProbabilities of columnsProbabilities of an alignmentThe full modelImplementationThe databaseFrequenciesMutation ratesGrammar parametersResultsThe test sequencesUsing related sequencesNeglecting phylogenyWeight of resultsComparison with other methodsConclusionThe limitationsThe improvements
3Introduction Single sequence e.g. Zuker(1989) using prior information on RNA structures, through energy functionsnot ideal when estimating structures of sequences with known homologsMultiple sequencesCovariance methods (Eddy and Durbin, 1994)Profile stochastic context-free grammars (SCFGs Sakakibara et al. 1994)Characteristics: do not explicitly take phylogeny into account, and do not use a prior probability distribution of structuresMaximum weighted matching methods(Cary and Stormo, 1995; Tabaska et al. 1998)share the above characteristics
4The method used in this paper uses prior knowledge about RNA structure in making a maximum a posteriori (MAP) estimation of the 2nd structure.Performed on an alignment of sequences assumed to have identical 2nd structure, i.e. the alignment is assumed to be a structural alignment.Take the phylogenetic tree of the sequences into account, including branch lengths, using a model of mutation processes in RNA.The tree can be estimated by a maximum likelihood (ML) method.Originating from Goldman et al. (1996)Predicting protein 2nd structure using HMMs including phylogenetic informationDifference2nd structure in RNA are not local, like in proteinsSCFGs instead of HMM is used hereLimitationSCFGs are unable to model crossing interactions, thus pseudoknots cannot be predicted
5Algorithm Input an alignment of RNA sequences Output single common structure for the sequencesThe modelThe SCFGThe evolutionary model
6The grammar a set of variables; some terminal and non-terminal
8Probabilities of columns Given the treeA column of non-paring bases is independent of the other columnsTwo paring columns is assumed to be independent of any other columnsP = (pA, pU, pG, pC)the distribution of bases in loop regions of RNA sequencesThe rate matrixReversibility of mutationsFor base pair (16 by 16) rate matrixGiven a tree, including branch lengths, the column probabilities are calculated using post-order traversal as described by Felsenstein (1981)
9Probability of an alignment The input data: D=(C1, C2, …,Cl)The model: MThe tree: T2nd structure: σs: a single based: a left column of pairsdc: the right column of the pair
10The core model The SCFG The evolutionary model The grammar is equivalent to a grammar that generates column in alignments instead of just secondary structure, meaning that for a two-sequence alignment, the production rule covers the following rules:
11The full modelThe ML estimate of the tree, given the model (If no phylogenetic tree)MAP (Maximum a posteriori) estimation of the most likely 2nd structureby Bayes theoremwhereP(σ|T,M) is the prior distribution of structures given by the SCFG
12Implementation The database The database used for estimating this model should represent RNA structure in general.The database should be composed of various types of RNA.tRNAs database by Sprinzl et al. (1998)ribosomal RNAs (LSU rRNAs) by De Rijk et al. (1998)
13FrequenciesThe single base frequencies were estimated from counts of the bases in the single base positions of the sequences.Base pair frequencies were estimated by counting base pairs.
14Mutation rates For a given pair, P, tp : the time between sequences Np: the number of columns in the two-sequence alignmentPs: the prob. of a base being in a single base position
17Grammar parameters by inside-outside algorithm (an expectation maximization procedure) on the training set et of secondary structure (Baker, 1979; Lari and Young, 1990) This is just like the forward-backward algorithm in HMM !!!
18Results The test sequences 4 bacterial RNase P RNA seq. alignment: 385 columnspair-wise sequence identities 65-92%
19Pseudoknot 68-76 and 368-361; 18-12 and 370-364 At least 22 positions wrongly predicted in each sequence
22Weight of results by inside and outside variables, calculate the probability that each position is correctly predicted. How certainty the predictions are, assuming that the model is correct.
23Comparison with other methods The energy minimization method has more parameters, better resultsCOVE (Eddy and Durbin, 1994) with lower accuracyThis shows the significance of the method described here in situations where only a few sequences are known.
24Conclusion Limitations Inability to predict pseudoknots.Loop and stem lengths are assumed to be geometrically distributedThe nature of the specific SCFG used hereA good alignment is needed – hard to solveThe dynamical programming algorithms are relatively slow. [They have a time complexity of O(N3) with respect to the length of the alignment.]
25Conclusion Possible improvements Profile SCFGs and covariance models predict 2nd structure at the same time as making alignmentsModeling base stackingThe evolutionary modelReduce the number of parameters for the rate matrix