1 A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department.

Slides:



Advertisements
Similar presentations
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Advertisements

Longest Common Subsequence
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 3: Parallel Algorithm Design
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Complexity 7-1 Complexity Andrei Bulatov Complexity of Problems.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Complexity Analysis (Part I)
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
CHAPTER 4 Decidability Contents Decidable Languages
CSC401 – Analysis of Algorithms Lecture Notes 12 Dynamic Programming
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
COMP305. Part II. Genetic Algorithms. Genetic Algorithms.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Finding Common RNA Pseudoknot Structures in Polynomial Time Patricia Evans University of New Brunswick.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Developing Pairwise Sequence Alignment Algorithms
Presented by Johanna Lind and Anna Schurba Facility Location Planning using the Analytic Hierarchy Process Specialisation Seminar „Facility Location Planning“
Strand Design for Biomolecular Computation
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
CSC401: Analysis of Algorithms CSC401 – Analysis of Algorithms Chapter Dynamic Programming Objectives: Present the Dynamic Programming paradigm.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
CS 8833 Algorithms Algorithms Dynamic Programming.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
CSCI 256 Data Structures and Algorithm Analysis Lecture 16 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
The Acceptance Problem for TMs
Divide & Conquer Algorithms
RNA sequence-structure alignment
Summary of lectures Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture Slides Recurrence and Master Theorem (Chapter 4). Lecture Slides.
Hidden Markov Models Part 2: Algorithms
Estimating Recombination Rates
Data Structures Review Session
Intro to Alignment Algorithms: Global and Local
Predicting the Secondary Structure of RNA
Comparative RNA Structural Analysis
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Dynamic Programming Dynamic Programming 1/18/ :45 AM
Dynamic Programming II DP over Intervals
Computational Genomics Lecture #3a
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

1 A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup Yelena Frid and Dan Gusfield Department of Computer Science UC Davis

2 Background -Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by: Nussinov et. Al. (1978, 1980) Waterman and Smith (1978) Zuker M. and Stiegler (1981) They presented Dynamic programming solutions with an asymptotic runtime of O(n 3 ). Where n is the length of the RNA sequence.

3 There have been several improvements to the running time of this problem: A complex worse case speed up O( n 3 * (loglogn)/(logn) 1/2 )– (Akutsu 1999) Practical heuristic speed up O(nP) where P in [n, n 2 ] (Wexler 2007, Backofen, 2009). This is still O( n 3 ) in terms of n. It is important to note that P is empirically much less then n 2. Our approach has an O( n 3 /log(n)) running time using the FOUR RUSSIANS speedup.

4 Four Russians is well known -The Four Russians method is a general technique for speeding up Dynamic programs. - The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen. However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.

5 Possible reasons for difficulty in applying the FOUR RUSSIANS speedup I.Widely exposed version of the original dynamic- programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.) II.It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.) We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.

6 Basic RNA-folding problem - Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n. Enhancements: -Richer scoring schemes -Minimum energy calculations DP.

7 Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G} A matching consists of non-crossing disjoint pairs of sites in K. i ii’jj’

B(i,j) B(i,j) is the score given to the match between nucleotides in sites i and j of K. In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.

9 Introduction to the original O( n 3 ) DP Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively. The DP recurrence relations are based on the different possibilities of pairing nucleotides.

10 S(i,j) will equal the maximum of : S(i+1,j-1)+B(i,j) S(i, j-1)

11 Head Tail S(i+1, j)

12 As a result the recurrences for the optimal fold score S(i,j) are: HeadTail

13 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Usually evaluated in increasing length of subsequence i.e. distance between i and j.

14 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[0,4]S[0,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i -We choose an alternative permissible order. -The algorithm will evaluate in order of increasing j and decreasing i.

15 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) Rule d S(i,j)=max O(n 2 )

16 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max O(n) O(n 2 ) O(n) O(n 3 ) Bottleneck

17 for j= 2 to n for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a S(i,j-1) Rule b for i = j-1 to 1 S(i,j) = max( S(i,j), S(i+1,j) ) Rule c for k=j-1 to i+1 {Rule d loop} S(i,j)=max( S(i,j), S(i,k-1)+S(k,j) ) S(i,j)=max Idea: Change the decrement by one to a constant number of operations in each group of size q. Four Russians speedup: first hint

What will Rule d loop look like then?

19 Computation components for speeding up Rule d loop There are several components for speeding up Rule d loop. Rgroup Vg vector Little v g vector k* R Table At first these may seem unclear and extraneous but will lead to a speed-up.

20 Rgroup g Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.

21 AUCGGCAUCAA S[0,1]S[0,2]S[0,3]S[0,4]S[0,5]S[0,6]S[0,11] 1 S[1,2]S[1,3]S[1,4]S[1,5]S[1,6]S[1,11] 2 S[2,3]S[2,4]S[2,5]S[2,6]S[2,11] 3 S[3,4]S[3,5]S[3,6]S[3,11] 4 S[4,5]S[4,6]S[4,11] 5 S[5,6]S[5,11] 6 S[6,11] 7 S[7,11] 8 S[8,11] 9 S[9,11] j …… i Rgroup example where q=3 Rgroup 0 Rgroup 1 Rgroup 2

22 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* R Table

23 Vg vector For a fixed j, consider an Rgroup g consisting of rows z, z-1, … z-q+1 for some z<j. Let Vg = {S(z,j), S(z-1, j),… S(z-q+1,j)} for all Rgroups g. V0 V1

24 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k*

25 Little v g vector Observe that for the simple scoring scheme: B(i,j)=1 when (i,j) is a permitted match B(i,j)=0 when (i,j) is not a permitted match S(z-1,j) – S(z,j) = {1 or 0}. V0 V1 5-4=1 5-5=0 Hence, the difference between consecutive values in any Vg vector are either 0 or 1.

26 Little v g vector The changes in consecutive values of Vg could therefore be encoded into little v g,a vector of length q-1. V0 V1 encode* 5-4=1 5-5=0 v0v0 Base of V0

27 Little vector v g V1 V0 encode 3-3=0 4-3=1 v1v1 - Encoding each Vg into little v g takes O(q) time. - Each column has j/q different Rgroups. - O(n) time to do encoding for the entire column.

Decoding the offset- non formal definition Decode: v g  V`; where - V’ is a vector of length q-1 - if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.

29 decode formal definition and example It will be a running sum of the values in little vector v g. decode: v g ->V’ ; where Example q=7 Vg encode 7-7=0 8-7=1 6-5=1 7-6=1 5-4=1 5-5=0 vgvg 3+0=3 3+1=4 1+1=2 2+1= =1 V’ decode Offset from the base base

30 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

31 Defining k* for an Rgroup g In column j, let k*(i,g,j) be the index k such that the sum S(i,k-1)+S(k,j) is maximized over the indices in Rgroup g.

32 Example S(3,4)+S(5,11) S(3,5)+S(6,11) S(3,6)+S(7,11) For example, assuming column j=11 and row i=3: To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of and store the k that corresponds to that max bipartition. This would require O(q) time operations.

33 Computation components for speeding up Rule d loop Rgroup Vg vector Little v g vector k* Table R

Introducing Table R (currently a black box) that given (i,g,v g ) returns k*(i,g,j) in O(1) operations.

35 for j= 2 to n …. for i = j-1 to 1 for k=j-1 to i+1 get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R for g=(j-1)/q to (i+1)/q {Rule d loop}

36 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule d loop: where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Run-time of Modified Rule d loop using Table R The asymptotic run- time for Rule d loop is O( n 2 /q ) O(n) O(n/q) q*n 2

37 for j= 2 to n …. for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete } get v g given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) Modified Rule d loop using Table R Having k* speeds up the Rule d loop. How can we know k* when we need it? Answer: Four-Russians preprocessing.

38 Preprocessing

39 Preprocessing Table R: Cgroups We conceptually divide the columns into groups of size q. Example with q=3.

40 Preprocessing Table R Assume that we run the ‘ Second Dynamic Programming Algorithm ’ until j=q.  That means that all the S(i,j) values in Cgroup 0 have been computed. At this point we can compute the following: for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0 such that S(i,k-1) + V’[k] is maximized.

41 Preprocessing in general In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed. That is, R(i,g,v) could be found and stored for all i < g*q. once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(qn 2 q-1 ) Total for pre-processing for all Cgroups is O(qn*2 q-1 ) * O(n/q)=O(n 2 *2 q-1 ) time

42 RNA folding algorithm with Four-Russians Speedup

43 RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1 S(i,j) = max( S(i+1,j-1)+B(i,j), S(i,j-1)) for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { Rule loop d} get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,v g ) S(i,j)=max( S(i,j), S(i,k*-1)+S(k*,j) ) once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized. O(n 2 *2 q-1 ) O(n 3 /q)

44 Run-Time Setting q= log(n) the total run-time of O(n 2 *2 q-1 ) + O(n 3 /q)+ O(n 2 q) simplifies to O(n 3 /log(n)) time total. Pre-processing Computation

45 Empirical Results - We compare our Four-Russians algorithm to the original O( n 3 )-time algorithm. -The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp. -The algorithm performs identically for randomly generated and gen-Bank sequences of equal length. -This is to be expected because the algorithm's run time is sequence character independent.

46 * *seconds

47 Variations on scoring scheme B The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n. Let C denote the size of the set of all possible differences. Then asymptotic time is: whenand

48 Parallel Computing When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)* n 2 ). (Could be reduced to O( n 2 ) time by a re-ordering method; space trade off)

49 Conclusion - A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes. -The asymptotic time of O( n 3 /log(n)) is achieved by interleaving computation and preprocessing. -This time can be further lowered to O(log(n)* n 2 ) when using n processors in parallel.