Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua.

Similar presentations


Presentation on theme: "CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua."— Presentation transcript:

1 CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc

2 CS262 Lecture 12, Win06, Batzoglou Decoding: the CYK algorithm Given x = x 1....x N, and a SCFG G, Find the most likely parse of x (the most likely alignment of G to x) Dynamic programming variable:  (i, j, V):likelihood of the most likely parse of x i …x j, rooted at nonterminal V Then,  (1, N, S): likelihood of the most likely parse of x by the grammar

3 CS262 Lecture 12, Win06, Batzoglou The CYK algorithm (Cocke-Younger-Kasami) Initialization: For i = 1 to N, any nonterminal V,  (i, i, V) = log P(V  x i ) Iteration: For i = 1 to N – 1 For j = i+1 to N For any nonterminal V,  (i, j, V) = max X max Y max i  k<j  (i,k,X) +  (k+1,j,Y) + log P(V  XY) Termination: log P(x | ,  * ) =  (1, N, S) Where  * is the optimal parse tree (if traced back appropriately from above) i j V XY

4 CS262 Lecture 12, Win06, Batzoglou A SCFG for predicting RNA structure S  a S | c S | g S | u S |   S a | S c | S g | S u  a S u | c S g | g S u | u S g | g S c | u S a  SS Adjust the probability parameters to reflect bond strength etc No distinction between non-paired bases, bulges, loops Can modify to model these events  L: loop nonterminal  H: hairpin nonterminal  B: bulge nonterminal  etc

5 CS262 Lecture 12, Win06, Batzoglou CYK for RNA folding Initialization:  (i, i-1) = log P(  ) Iteration: For i = 1 to N For j = i to N  (i+1, j–1) + log P(x i S x j )  (i, j–1) + log P(S x i )  (i, j) = max  (i+1, j) + log P(x i S) max i < k < j  (i, k) +  (k+1, j) + log P(S S)

6 CS262 Lecture 12, Win06, Batzoglou Evaluation Recall HMMs: Forward:f l (i) = P(x 1 …x i,  i = l) Backward:b k (i) = P(x i+1 …x N |  i = k) Then, P(x) =  k f k (N) a k0 =  l a 0l e l (x 1 ) b l (1) Analogue in SCFGs: Inside:a(i, j, V) = P(x i …x j is generated by nonterminal V) Outside: b(i, j, V) = P(x, excluding x i …x j is generated by S and the excluded part is rooted at V)

7 CS262 Lecture 12, Win06, Batzoglou The Inside Algorithm To compute a(i, j, V) = P(x i …x j, produced by V) a(i, j, v) =  X  Y  k a(i, k, X) a(k+1, j, Y) P(V  XY) kk+1 i j V XY

8 CS262 Lecture 12, Win06, Batzoglou Algorithm: Inside Initialization: For i = 1 to N, V a nonterminal, a(i, i, V) = P(V  x i ) Iteration: For i = 1 to N-1 For j = i+1 to N For V a nonterminal a(i, j, V) =  X  Y  k a(i, k, X) a(k+1, j, X) P(V  XY) Termination: P(x |  ) = a(1, N, S)

9 CS262 Lecture 12, Win06, Batzoglou The Outside Algorithm b(i, j, V) = Prob(x 1 …x i-1, x j+1 …x N, where the “gap” is rooted at V) Given that V is the right-hand-side nonterminal of a production, b(i, j, V) =  X  Y  k<i a(k, i – 1, X) b(k, j, Y) P(Y  XV) i j V k X Y

10 CS262 Lecture 12, Win06, Batzoglou Algorithm: Outside Initialization: b(1, N, S) = 1 For any other V, b(1, N, V) = 0 Iteration: For i = 1 to N-1 For j = N down to i For V a nonterminal b(i, j, V) =  X  Y  k<i a(k, i – 1, X) b(k, j, Y) P(Y  XV) +  X  Y  k<i a(j+1, k, X) b(i, k, Y) P(Y  VX) Termination: It is true for any i, that: P(x |  ) =  X b(i, i, X) P(X  x i )

11 CS262 Lecture 12, Win06, Batzoglou Learning for SCFGs We can now estimate c(V) = expected number of times V is used in the parse of x 1 ….x N 1 c(V) = ––––––––  1  i  N  i  j  N a(i, j, V) b(i, j, v) P(x |  ) 1 c(V  XY) = ––––––––  1  i  N  i<j  N  i  k<j b(i,j,V) a(i,k,X) a(k+1,j,Y) P(V  XY) P(x |  )

12 CS262 Lecture 12, Win06, Batzoglou Learning for SCFGs Then, we can re-estimate the parameters with EM, by: c(V  XY) P new (V  XY) = –––––––––––– c(V) c(V  a)  i: xi = a b(i, i, V) P(V  a) P new (V  a) = –––––––––– = –––––––––––––––––––––––––––––––– c(V)  1  i  N  i<j  N a(i, j, V) b(i, j, V)

13 CS262 Lecture 12, Win06, Batzoglou Summary: SCFG and HMM algorithms GOALHMM algorithmSCFG algorithm Optimal parseViterbiCYK EstimationForwardInside BackwardOutside LearningEM: Fw/BckEM: Ins/Outs Memory ComplexityO(N K)O(N 2 K) Time ComplexityO(N K 2 )O(N 3 K 3 ) Where K: # of states in the HMM # of nonterminals in the SCFG

14 CS262 Lecture 12, Win06, Batzoglou The Zuker algorithm – main ideas Models energy of a fold in terms of specific features: 1.Pairs of base pairs (stacked pairs) 2.Bulges 3.Loops (size, composition) 4.Interactions between stem and loop 5’ 3’ position i position j length l 5’ 3’ position i position j 5’ 3’ positions i position j position j’

15 Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5

16 CS262 Lecture 12, Win06, Batzoglou Inferring Phylogenies Trees can be inferred by several criteria:  Morphology of the organisms Can lead to mistakes!  Sequence comparison Example: Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

17 CS262 Lecture 12, Win06, Batzoglou Modeling Evolution During infinitesimal time  t, there is not enough time for two substitutions to happen on the same nucleotide So we can estimate P(x | y,  t), for x, y  {A, C, G, T} Then let P(A|A,  t) …… P(A|T,  t) S(  t) = ……… P(T|A,  t) ……P(T|T,  t) xx y tt

18 CS262 Lecture 12, Win06, Batzoglou Modeling Evolution Reasonable assumption: multiplicative (implying a stationary Markov process) S(t+t’) = S(t)S(t’) That is, P(x | y, t+t’) =  z P(x | z, t) P(z | y, t’) Jukes-Cantor: constant rate of evolution 1 - 3     For short time , S(  ) = I+R  =  1 - 3      1 - 3      1 - 3  AC GT

19 CS262 Lecture 12, Win06, Batzoglou Modeling Evolution Jukes-Cantor: For longer times, r(t)s(t) s(t) s(t) S(t) = s(t)r(t) s(t) s(t) s(t)s(t) r(t) s(t) s(t)s(t) s(t) r(t) Where we can derive: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t ) S(t+  ) = S(t)S(  ) = S(t)(I + R  ) Therefore, (S(t+  ) – S(t))/  = S(t) R At the limit of   0, S’(t) = S(t) R Equivalently, r’ = -3  r + 3  s s’ = -  s +  r Those diff. equations lead to: r(t) = ¼ (1 + 3 e -4  t ) s(t) = ¼ (1 – e -4  t )

20 CS262 Lecture 12, Win06, Batzoglou Modeling Evolution Kimura: Transitions: A/G, C/T Transversions: A/T, A/C, G/T, C/G Transitions (rate  ) are much more likely than transversions (rate  ) r(t)s(t) u(t) s(t) S(t) = s(t)r(t) s(t) u(t) u(t)s(t) r(t) s(t) s(t)u(t) s(t) r(t) Wheres(t) = ¼ (1 – e -4  t ) u(t) = ¼ (1 + e -4  t – e -2(  +  )t ) r(t) = 1 – 2s(t) – u(t)

21 CS262 Lecture 12, Win06, Batzoglou Phylogeny and sequence comparison Basic principles: Degree of sequence difference is proportional to length of independent sequence evolution Only use positions where alignment is pretty certain – avoid areas with (too many) gaps

22 CS262 Lecture 12, Win06, Batzoglou Distance between two sequences Given sequences x i, x j, Define d ij = distance between the two sequences One possible definition: d ij = fraction f of sites u where x i [u]  x j [u] Better model (Jukes-Cantor): f = 3 s(t) = ¾ (1 – e -4  t )  ¾ e -4  t = ¾ – f  log (e -4  t ) = log (1 – 4/3 f)  -4  t = log(1 – 4/3 f) d ij = t = - ¼  -1 log(1 – 4/3 f)

23 CS262 Lecture 12, Win06, Batzoglou A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters C i, C j of sequences, 1 d ij = –––––––––  {p  Ci, q  Cj} d pq |C i |  |C j | Claim that if C k = C i  C j, then distance to another cluster C l is: d il |C i | + d jl |C j | d kl = –––––––––––––– |C i | + |C j | Proof  Ci,Cl d pq +  Cj,Cl d pq d kl = –––––––––––––––– (|C i | + |C j |) |C l | |C i |/(|C i ||C l |)  Ci,Cl d pq + |C j |/(|C j ||C l |)  Cj,Cl d pq = –––––––––––––––––––––––––––––––––––– (|C i | + |C j |) |C i | d il + |C j | d jl = ––––––––––––– (|C i | + |C j |)

24 CS262 Lecture 12, Win06, Batzoglou Algorithm: Average Linkage Initialization: Assign each x i into its own cluster C i Define one leaf per sequence, height 0 Iteration: Find two clusters C i, C j s.t. d ij is min Let C k = C i  C j Define node connecting C i, C j, & place it at height d ij /2 Delete C i, C j Termination: When two clusters i, j remain, place root at height d ij /2 1 4 3 2 5 1 4 2 3 5

25 CS262 Lecture 12, Win06, Batzoglou Example vwxyz v 06888 w 0888 x 044 y 02 z 0 yzxwv 1 2 3 4 vwxyz v 0688 w 088 x 04 0 vwxyz v 068 w 08 0 vwxyz vw 08 xyz 0

26 CS262 Lecture 12, Win06, Batzoglou Ultrametric Distances and Molecular Clock Definition: A distance function d(.,.) is ultrametric if for any three distances d ij  d ik  d ij, it is true that d ij  d ik = d ij The Molecular Clock: The evolutionary distance between species x and y is 2  the Earth time to reach the nearest common ancestor That is, the molecular clock has constant rate in all species 14235 years The molecular clock results in ultrametric distances

27 CS262 Lecture 12, Win06, Batzoglou Ultrametric Distances & Average Linkage Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances Proof: Exercise 1423 5

28 CS262 Lecture 12, Win06, Batzoglou Weakness of Average Linkage Molecular clock: all species evolve at the same rate (Earth time) However, certain species (e.g., mouse, rat) evolve much faster Example where UPGMA messes up: 2 3 4 1 1 4 3 2 Correct tree AL tree

29 CS262 Lecture 12, Win06, Batzoglou Additive Distances Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them Given a tree T & additive distances d ij, can uniquely reconstruct edge lengths: Find two neighboring leaves i, j, with common parent k Place parent node k at distance d km = ½ (d im + d jm – d ij ) from any node m 1 2 3 4 5 6 7 8 9 10 12 11 13 d 1,4

30 CS262 Lecture 12, Win06, Batzoglou Additive Distances For any four leaves x, y, z, w, consider the three sums d(x, y) + d(z, w) d(x, z) + d(y, w) d(x, w) + d(y, z) One of them is smaller than the other two, which are equal d(x, y) + d(z, w) < d(x, z) + d(y, w) = d(x, w) + d(y, z) x y z w

31 CS262 Lecture 12, Win06, Batzoglou Reconstructing Additive Distances Given T x y z w v 5 4 7 3 3 4 6 vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths D

32 CS262 Lecture 12, Win06, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T D

33 CS262 Lecture 12, Win06, Batzoglou Reconstructing Additive Distances Given T x y z w v vwxyz v 0101716 w 01514 x 0915 y 014 z 0 T D axyz a 01110 x 0915 y 014 z 0 a D1D1 d ax = ½ (d vx + d wx – d vw ) d ay = ½ (d vy + d wy – d vw ) d az = ½ (d vz + d wz – d vw )

34 CS262 Lecture 12, Win06, Batzoglou Reconstructing Additive Distances Given T x y z w v T axyz a 01110 x 0915 y 014 z 0 a D1D1 abz a 0610 b 0 z 0 D2D2 b c ac a 03 c 0 D3D3 d(a, c) = 3 d(b, c) = d(a, b) – d(a, c) = 3 d(c, z) = d(a, z) – d(a, c) = 7 d(b, x) = d(a, x) – d(a, b) = 5 d(b, y) = d(a, y) – d(a, b) = 4 d(a, w) = d(z, w) – d(a, z) = 4 d(a, v) = d(z, v) – d(a, z) = 6 Correct!!! 5 4 7 3 3 4 6

35 CS262 Lecture 12, Win06, Batzoglou Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define D ij = d ij – (r i + r j ) Where 1 r i = –––––  k d ik |L| - 2 Claim: The above “magic trick” ensures that D ij is minimal iff i, j are neighbors Proof: Very technical, please read Durbin et al.! 1 2 4 3 0.1 0.4

36 CS262 Lecture 12, Win06, Batzoglou Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. D ij is minimal Define a new node k, and set d km = ½ (d im + d jm – d ij ) for all m  L Add k to T, with edges of lengths d ik = ½ (d ij + r i – r j ) Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length d ij

37 CS262 Lecture 12, Win06, Batzoglou Parsimony – What if we don’t have distances One of the most popular methods:  GIVEN multiple alignment  FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

38 CS262 Lecture 12, Win06, Batzoglou Example A B A A {A, B} Cost C+=1 {A} Final cost C = 1 {A} {B} {A}

39 CS262 Lecture 12, Win06, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; k = 2N – 1 Iteration: If k is a leaf, set R k = { x k [u] } If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i  R j if intersection is nonempty Set R k = R i  R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

40 CS262 Lecture 12, Win06, Batzoglou Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}

41 CS262 Lecture 12, Win06, Batzoglou Traceback: 1.Choose an arbitrary nucleotide from R 2N – 1 for the root 2.Having chosen nucleotide r for parent k, If r  R i choose r for daughter i Else, choose arbitrary nucleotide from R i Easy to see that this traceback produces some assignment of cost C Traceback to find ancestral nucleotides

42 CS262 Lecture 12, Win06, Batzoglou Example A B A B {A, B} {A} {B} {A} {B} A B A B A A A x x A B A B A B A x x A B A B B B B x x Admissible with Traceback Still optimal, but inadmissible with Traceback

43 CS262 Lecture 12, Win06, Batzoglou Probabilistic Methods A more refined measure of evolution along a tree than parsimony P(x 1, x 2, x root | t 1, t 2 ) = P(x root ) P(x 1 | t 1, x root ) P(x 2 | t 2, x root ) If we use Jukes-Cantor, for example, and x 1 = x root = A, x 2 = C, t 1 = t 2 = 1, = p A  ¼(1 + 3e -4α )  ¼(1 – e -4α ) = (¼) 3 (1 + 3e -4α )(1 – e -4α ) x1x1 t2t2 x root t1t1 x2x2

44 CS262 Lecture 12, Win06, Batzoglou Probabilistic Methods If we know all internal labels x u, P(x 1, x 2, …, x N, x N+1, …, x 2N-1 | T, t) = P(x root )  j  root P(x j | x parent(j), t j, parent(j) ) Usually we don’t know the internal labels, therefore P(x 1, x 2, …, x N | T, t) =  x N+1  x N+2 …  x 2N-1 P(x 1, x 2, …, x 2N-1 | T, t) x root x1x1 x2x2 xNxN xuxu

45 CS262 Lecture 12, Win06, Batzoglou Felsenstein’s Likelihood Algorithm To calculate P(x 1, x 2, …, x N | T, t) Initialization: Set k = 2N – 1 Iteration: Compute P(L k | a) for all a   If k is a leaf node: Set P(L k | a) = 1(a = x k ) If k is not a leaf node: 1. Compute P(L i | b), P(L j | b) for all b, for daughter nodes i, j 2. Set P(L k | a) =  b,c P(b | a, t i ) P(L i | b) P(c | a, t j ) P(L j | c) Termination: Likelihood at this column = P(x 1, x 2, …, x N | T, t) =  a P(L 2N-1 | a)P(a) Let P(L k | a) denote the prob. of all the leaves below node k, given that the residue at k is a

46 CS262 Lecture 12, Win06, Batzoglou Probabilistic Methods Given M (ungapped) alignment columns of N sequences, Define likelihood of a tree: L(T, t) = P(Data | T, t) =  m=1…M P(x 1m, …, x nm, T, t) Maximum Likelihood Reconstruction: Given data X = (x ij ), find a topology T and length vector t that maximize likelihood L(T, t)

47 CS262 Lecture 12, Win06, Batzoglou Current popular methods HUNDREDS of programs available! http://evolution.genetics.washington.edu/phylip/software.html#methods Some recommended programs: Discrete—Parsimony-based  Rec-1-DCM3 http://www.cs.utexas.edu/users/tandy/mp.html Tandy Warnow and colleagues Probabilistic  SEMPHY http://www.cs.huji.ac.il/labs/compbio/semphy/ Nir Friedman and colleagues


Download ppt "CS262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua."

Similar presentations


Ads by Google