Download presentation

Presentation is loading. Please wait.

Published byKaren Varden Modified over 3 years ago

1
1 A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

2
2 Background -Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by: Nussinov et. Al. (1978, 1980) Waterman and Smith (1978) Zuker M. and Stiegler (1981) They presented Dynamic programming solutions with an asymptotic runtime of O(n 3 ). Where n is the length of the RNA sequence.

3
3 There have been several improvements to the running time of this problem: A complex worse case speed up O( n 3 * (loglogn)/(logn) 1/2 )– (Akutsu 1999) Practical heuristic speed up O(nP) where P in [n, n 2 ] (Wexler 2007, Backofen, 2009). This is still O( n 3 ) in terms of n. It is important to note that P is empirically much less then n 2. Our approach has an O( n 3 /log(n)) running time using the FOUR RUSSIANS speedup.

4
4 Four Russians is well known -The Four Russians method is a general technique for speeding up Dynamic programs. - The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen. However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

5
5 Possible reasons for difficulty in applying the FOUR RUSSIANS speedup I.Widely exposed version of the original dynamic- programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.) II.It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.) We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

6
6 Basic RNA-folding problem - Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n. Enhancements: -Richer scoring schemes -Minimum energy calculations DP.

7
7 Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G} A matching consists of non-crossing disjoint pairs of sites in K. i ii’jj’

8
B(i,j) B(i,j) is the score given to the match between nucleotides in sites i and j of K. In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

9
9 Introduction to the original O( n 3 ) DP Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively. The DP recurrence relations are based on the different possibilities of pairing nucleotides.

10
10 S(i,j) will equal the maximum of : S(i+1,j-1)+B(i,j) S(i, j-1)

11
11 Head Tail S(i+1, j)

12
12 As a result the recurrences for the optimal fold score S(i,j) are: HeadTail

13
13 AUCGGCAUCAA 01234561112 0 S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Usually evaluated in increasing length of subsequence i.e. distance between i and j.

14
14 AUCGGCAUCAA 01234561112 0 S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i -We choose an alternative permissible order. -The algorithm will evaluate in order of increasing j and decreasing i.

15
15 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) Rule d S(i,j)=max O(n 2 )

16
16 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max O(n) O(n 2 ) O(n) O(n 3 ) Bottleneck

17
17 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max Idea: Change the decrement by one to a constant number of operations in each group of size q. Four Russians speedup: first hint

18
What will Rule d loop look like then?

19
19 Computation components for speeding up Rule d loop There are several components for speeding up Rule d loop. Rgroup Vg vector Little v g vector k* R Table At first these may seem unclear and extraneous but will lead to a speed-up.

20
20 Rgroup g Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

21
21 AUCGGCAUCAA 01234561112 0 S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[1,4]S[1,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Rgroup example where q=3 Rgroup 0 Rgroup 1 Rgroup 2

22
22 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* R Table

23
23 Vg vector For a fixed j, consider an Rgroup g consisting of rows z, z-1, … z-q+1 for some z

24
24 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k*

25
25 Little v g vector Observe that for the simple scoring scheme: B(i,j)=1 when (i,j) is a permitted match B(i,j)=0 when (i,j) is not a permitted match S(z-1,j) – S(z,j) = {1 or 0}. V0 V1 5-4=1 5-5=0 Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

26
26 Little v g vector The changes in consecutive values of Vg could therefore be encoded into little v g,a vector of length q-1. V0 V1 encode* 5-4=1 5-5=0 v0v0 Base of V0

27
27 Little vector v g V1 V0 encode 3-3=0 4-3=1 v1v1 - Encoding each Vg into little v g takes O(q) time. - Each column has j/q different Rgroups. - O(n) time to do encoding for the entire column.

28
Decoding the offset- non formal definition Decode: v g V`; where - V’ is a vector of length q-1 - if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

29
29 decode formal definition and example It will be a running sum of the values in little vector v g. decode: v g ->V’ ; where Example q=7 Vg encode 7-7=0 8-7=1 6-5=1 7-6=1 5-4=1 5-5=0 vgvg 3+0=3 3+1=4 1+1=2 2+1=3 1 1+0=1 V’ decode Offset from the base base

30
30 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

31
31 Defining k* for an Rgroup g In column j, let k*(i,g,j) be the index k such that the sum S(i,k-1)+S(k,j) is maximized over the indices in Rgroup g.

32
32 Example S(3,4)+S(5,11) S(3,5)+S(6,11) S(3,6)+S(7,11) For example, assuming column j=11 and row i=3: To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of and store the k that corresponds to that max bipartition. This would require O(q) time operations.

33
33 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

34
Introducing Table R (currently a black box) that given (i,g,v g ) returns k*(i,g,j) in O(1) operations.

35
35 for j= 2 to n …. for i = j-1 to 1 for k=j-1 to i+1 get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R for g=(j-1)/q to (i+1)/q {Rule d loop}

36
36 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule d loop: where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Run-time of Modified Rule d loop using Table R The asymptotic run- time for Rule d loop is O( n 2 /q ) O(n) O(n/q) q*n 2

37
37 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R Having k* speeds up the Rule d loop. How can we know k* when we need it? Answer: Four-Russians preprocessing.

38
38 Preprocessing

39
39 Preprocessing Table R: Cgroups We conceptually divide the columns into groups of size q. Example with q=3.

40
40 Preprocessing Table R Assume that we run the ‘ Second Dynamic Programming Algorithm ’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed. At this point we can compute the following: for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0 such that S(i,k-1) + V’[k] is maximized.

41
41 Preprocessing in general In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed. That is, R(i,g,v) could be found and stored for all i < g*q. once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(qn 2 q-1 ) Total for pre-processing for all Cgroups is O(qn*2 q-1 ) * O(n/q)=O(n 2 *2 q-1 ) time

42
42 RNA folding algorithm with Four-Russians Speedup

43
43 RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1 S(i,j) = max( S(i+1,j-1)+B(i,j), S(i,j-1)) for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule loop d} get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(n 2 *2 q-1 ) O(n 3 /q)

44
44 Run-Time Setting q= log(n) the total run-time of O(n 2 *2 q-1 ) + O(n 3 /q)+ O(n 2 q) simplifies to O(n 3 /log(n)) time total. Pre-processing Computation

45
45 Empirical Results - We compare our Four-Russians algorithm to the original O( n 3 )-time algorithm. -The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp. -The algorithm performs identically for randomly generated and gen-Bank sequences of equal length. -This is to be expected because the algorithm's run time is sequence character independent.

46
46 * *seconds

47
47 Variations on scoring scheme B The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n. Let C denote the size of the set of all possible differences. Then asymptotic time is: whenand

48
48 Parallel Computing When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)* n 2 ). (Could be reduced to O( n 2 ) time by a re-ordering method; space trade off)

49
49 Conclusion - A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes. -The asymptotic time of O( n 3 /log(n)) is achieved by interleaving computation and preprocessing. -This time can be further lowered to O(log(n)* n 2 ) when using n processors in parallel.

Similar presentations

OK

Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on social problem in contemporary india Ppt on intelligent manufacturing ppt Ppt on air water and land pollution Ppt on percentage for class 6 Ppt on dispersal of seeds by animals games Ppt on swami vivekananda life Ppt on video library management system Ppt on product advertising Ppt on strategic brand management by keller Ppt on pythagoras theorem for class 10