Download presentation

Presentation is loading. Please wait.

Published byKaren Varden Modified about 1 year ago

1
1 A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

2
2 Background -Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by: Nussinov et. Al. (1978, 1980) Waterman and Smith (1978) Zuker M. and Stiegler (1981) They presented Dynamic programming solutions with an asymptotic runtime of O(n 3 ). Where n is the length of the RNA sequence.

3
3 There have been several improvements to the running time of this problem: A complex worse case speed up O( n 3 * (loglogn)/(logn) 1/2 )– (Akutsu 1999) Practical heuristic speed up O(nP) where P in [n, n 2 ] (Wexler 2007, Backofen, 2009). This is still O( n 3 ) in terms of n. It is important to note that P is empirically much less then n 2. Our approach has an O( n 3 /log(n)) running time using the FOUR RUSSIANS speedup.

4
4 Four Russians is well known -The Four Russians method is a general technique for speeding up Dynamic programs. - The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen. However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

5
5 Possible reasons for difficulty in applying the FOUR RUSSIANS speedup I.Widely exposed version of the original dynamic- programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.) II.It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.) We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

6
6 Basic RNA-folding problem - Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n. Enhancements: -Richer scoring schemes -Minimum energy calculations DP.

7
7 Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G} A matching consists of non-crossing disjoint pairs of sites in K. i ii’jj’

8
B(i,j) B(i,j) is the score given to the match between nucleotides in sites i and j of K. In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

9
9 Introduction to the original O( n 3 ) DP Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively. The DP recurrence relations are based on the different possibilities of pairing nucleotides.

10
10 S(i,j) will equal the maximum of : S(i+1,j-1)+B(i,j) S(i, j-1)

11
11 Head Tail S(i+1, j)

12
12 As a result the recurrences for the optimal fold score S(i,j) are: HeadTail

13
13 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Usually evaluated in increasing length of subsequence i.e. distance between i and j.

14
14 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i -We choose an alternative permissible order. -The algorithm will evaluate in order of increasing j and decreasing i.

15
15 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) Rule d S(i,j)=max O(n 2 )

16
16 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max O(n) O(n 2 ) O(n) O(n 3 ) Bottleneck

17
17 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max Idea: Change the decrement by one to a constant number of operations in each group of size q. Four Russians speedup: first hint

18
What will Rule d loop look like then?

19
19 Computation components for speeding up Rule d loop There are several components for speeding up Rule d loop. Rgroup Vg vector Little v g vector k* R Table At first these may seem unclear and extraneous but will lead to a speed-up.

20
20 Rgroup g Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

21
21 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[1,4]S[1,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Rgroup example where q=3 Rgroup 0 Rgroup 1 Rgroup 2

22
22 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* R Table

23
23 Vg vector For a fixed j, consider an Rgroup g consisting of rows z, z-1, … z-q+1 for some z

24
24 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k*

25
25 Little v g vector Observe that for the simple scoring scheme: B(i,j)=1 when (i,j) is a permitted match B(i,j)=0 when (i,j) is not a permitted match S(z-1,j) – S(z,j) = {1 or 0}. V0 V1 5-4=1 5-5=0 Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

26
26 Little v g vector The changes in consecutive values of Vg could therefore be encoded into little v g,a vector of length q-1. V0 V1 encode* 5-4=1 5-5=0 v0v0 Base of V0

27
27 Little vector v g V1 V0 encode 3-3=0 4-3=1 v1v1 - Encoding each Vg into little v g takes O(q) time. - Each column has j/q different Rgroups. - O(n) time to do encoding for the entire column.

28
Decoding the offset- non formal definition Decode: v g V`; where - V’ is a vector of length q-1 - if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

29
29 decode formal definition and example It will be a running sum of the values in little vector v g. decode: v g ->V’ ; where Example q=7 Vg encode 7-7=0 8-7=1 6-5=1 7-6=1 5-4=1 5-5=0 vgvg 3+0=3 3+1=4 1+1=2 2+1= =1 V’ decode Offset from the base base

30
30 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

31
31 Defining k* for an Rgroup g In column j, let k*(i,g,j) be the index k such that the sum S(i,k-1)+S(k,j) is maximized over the indices in Rgroup g.

32
32 Example S(3,4)+S(5,11) S(3,5)+S(6,11) S(3,6)+S(7,11) For example, assuming column j=11 and row i=3: To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of and store the k that corresponds to that max bipartition. This would require O(q) time operations.

33
33 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

34
Introducing Table R (currently a black box) that given (i,g,v g ) returns k*(i,g,j) in O(1) operations.

35
35 for j= 2 to n …. for i = j-1 to 1 for k=j-1 to i+1 get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R for g=(j-1)/q to (i+1)/q {Rule d loop}

36
36 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule d loop: where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Run-time of Modified Rule d loop using Table R The asymptotic run- time for Rule d loop is O( n 2 /q ) O(n) O(n/q) q*n 2

37
37 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R Having k* speeds up the Rule d loop. How can we know k* when we need it? Answer: Four-Russians preprocessing.

38
38 Preprocessing

39
39 Preprocessing Table R: Cgroups We conceptually divide the columns into groups of size q. Example with q=3.

40
40 Preprocessing Table R Assume that we run the ‘ Second Dynamic Programming Algorithm ’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed. At this point we can compute the following: for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0 such that S(i,k-1) + V’[k] is maximized.

41
41 Preprocessing in general In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed. That is, R(i,g,v) could be found and stored for all i < g*q. once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(qn 2 q-1 ) Total for pre-processing for all Cgroups is O(qn*2 q-1 ) * O(n/q)=O(n 2 *2 q-1 ) time

42
42 RNA folding algorithm with Four-Russians Speedup

43
43 RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1 S(i,j) = max( S(i+1,j-1)+B(i,j), S(i,j-1)) for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule loop d} get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(n 2 *2 q-1 ) O(n 3 /q)

44
44 Run-Time Setting q= log(n) the total run-time of O(n 2 *2 q-1 ) + O(n 3 /q)+ O(n 2 q) simplifies to O(n 3 /log(n)) time total. Pre-processing Computation

45
45 Empirical Results - We compare our Four-Russians algorithm to the original O( n 3 )-time algorithm. -The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp. -The algorithm performs identically for randomly generated and gen-Bank sequences of equal length. -This is to be expected because the algorithm's run time is sequence character independent.

46
46 * *seconds

47
47 Variations on scoring scheme B The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n. Let C denote the size of the set of all possible differences. Then asymptotic time is: whenand

48
48 Parallel Computing When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)* n 2 ). (Could be reduced to O( n 2 ) time by a re-ordering method; space trade off)

49
49 Conclusion - A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes. -The asymptotic time of O( n 3 /log(n)) is achieved by interleaving computation and preprocessing. -This time can be further lowered to O(log(n)* n 2 ) when using n processors in parallel.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google