Presentation is loading. Please wait.

Presentation is loading. Please wait.

CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.

Similar presentations


Presentation on theme: "CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure."— Presentation transcript:

1 CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure

2 CISC667, F05, Lec19, Liao2 Facts about RNAs Mainly as “information carrier” in protein synthesis –mRNA –tRNA Also as catalysts –Ribozymes Also as regulator –RNAi Single strand –Intra-molecular pairing –Secondary structure –Essential for sequence stability and function Similarity in secondary structure plays a more important role than primary sequence in determining homology for RNAs

3 CISC667, F05, Lec19, Liao3

4 4

5 5

6 6 RNA stem loop hybridization pairing: A-U, and C-G where “-” stands for a pairing, and “x” for no pairing. In the above example, seq1 and seq2 fold into a similar structure, whereas seq3 does not. Pairwise alignments disregarding such structural restrictions may be misleading; seq2 and seq3 have 70% sequence identity, seq1 and seq3 have 60%, whereas seq1 and seq2 have only 30%. GC AU CG GA AA UA CG GC GA CA UC CU GG GA CA x x x seq1seq2 seq3 seq1 seq2 seq3 C A G G A A A C U G G C U G C A A A G C G C U G C A A C U G

7 CISC667, F05, Lec19, Liao7 Representations for palindromic structures 1) 2) 3) GC AU CG GA AA C A G G A A A C U G ( ( (.... ) ) ) There is a 1-1 correspondence between RNA secondary structures and well-balanced parenthesis expressions, where the balancing parentheses correspond to base pairings via hydrogen bonds.

8 CISC667, F05, Lec19, Liao8 Secondary structure prediction Base pair maximization algorithm [Nussinov] m i,j : maximal # of base pairs that can be formed for sequence s i …s j. d i,j = 1, if s i and s j are paired = 0, otherwise Initialization m i,i-1 = 0 for i= [2, L] m i,i = 0 for i = [1,L] Recursion m i,j = max {m i+1,j, m i,j-1, m i+1,j-1 + d i,j, max i<k<j {m i,k + m k+1,j } // bifurcation }

9 CISC667, F05, Lec19, Liao9 0123 0123 0122 0111 00111 0111 0 0 0 GGGAAAUCC 123456789 G G G A A A U C C 1 2 3 4 5 6 7 8 9 CG CG UA AA G Dynamic programming table Time Complexity: O(L 3 ), where L is sequence length Space complexity: O(L 2 )

10 CISC667, F05, Lec19, Liao10 Secondary structure prediction Minimum energry algorithm [Zuker] E i,j : minimum energy for an optimal fold formed by sequence s i …s j. e i,j = -5, if s i, s j is CG or GC = -4, if s i, s j is AU or UA = -1, if s i, s j is GU or UG Initialization m i,i-1 = 0 for i= [2, L] m i,i = 0 for i = [1,L] Recursion E i,j = min {E i+1,j, E i,j-1, E i+1,j-1 + e i,j, min i<k<j {E i,k + E k+1,j } // bifurcation }

11 CISC667, F05, Lec19, Liao11 Motivation: Both DNA and protein sequences can be considered as sentences in a “language”. We know the “language” is not random What’s the grammar of the language? “The linguistics of DNA” – David Searls, American Scientist, 80(19920579.

12 CISC667, F05, Lec19, Liao12 Chomsky hierarchy of transformational grammars (1956) Regular expression (no recursive structure, cannot handle palindromes) –W  aW, or W  a Context-free –W   Context-sensitive –  1 W  2   1   2 Unrestricted grammars –  1 W  2   Notations: a: any terminals; ,  : any string of non-terminals and/or terminals, including the null string;  : any string of non-terminals and/or terminals, not including the null string. Note: Chomsky normal form: any CFG can be written such that all productions have the following forms: W v  W y W z W v  a

13 CISC667, F05, Lec19, Liao13 To capture the palindromic structure, a context-free grammar can be given as follows. S  aW 1 u | cW 1 g | gW 1 c | uW 1 a W 1  aW 2 u| cW 2 g| gW 2 c| uW 2 a W 2  aW 3 u| cW 3 g| gW 3 c| uW 3 a W 3  gaaa | gcaa seq1 seq2 C A G G A A A C U G G C U G C A A A G C c a g g a a a c u g w3w3 w2w2 w1w1 S

14 CISC667, F05, Lec19, Liao14 Stochastic grammars Motivations: –Irregularities (or exceptions to the grammar) in languages –Such irregularities can be taken care of by “growing” your grammar; introducing new non-terminals and new production rules. –It is useful to differentiate productions that account for a large portion of the language from those that are rare exceptions. –This can be achieved by assigning probabilities to various productions. For any non-terminals, the probabilities of all its possible productions must add to 1. Given a sentence x, a stochastic grammar  parsing the sentence will assign a probability P(x|  ) that the sentence belongs to the language specified by the grammar, whereas, a conventional grammar will just give a yes-or-no answer. –  x  language P(x|  ) =1

15 CISC667, F05, Lec19, Liao15 Stochastic CFG for sequence modeling Calculate optimal alignment of a sequence to a parameterized SCFG. [Cocke-Younger-Kasami algorithm] Calculate probability for a given sequence to belong to the language specified by SCFG. [inside algorithm, and inside-outside algorithm] Given a set of example sequences/structures, estimate optimal probability parameters for a SCFG. Note: –The issues addressed above by SCFGs are quite similar to those for HMMs; actually it can be proved that HMMs are equivalent to stochastic regular grammars.

16 CISC667, F05, Lec19, Liao16 Resources Lecture notes: http://www.bioinfo.rpi.edu/~zukerm/lectures/RNAfold-html/ RNA informatics: http://www-lbit.iro.umontreal.ca/RNA_Links/RNA.shtml Software: -Vienna RNA fold


Download ppt "CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure."

Similar presentations


Ads by Google