Presentation is loading. Please wait.

Presentation is loading. Please wait.

Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Similar presentations


Presentation on theme: "Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian."— Presentation transcript:

1 Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian Will

2 RNA CPM 2012, Helsinki RNA R is an ordered pair (S,B) where: CUCGUCAGUACGACU U U C U C G U C A G U A C G AC U B is a set of base pairs C-G, G-C, A-U, or U-A base pair single base S is a sequence defined over  = {A,C,G,U} backbone connection

3 RNA CPM 2012, Helsinki RNA R is an ordered pair (S,B) where: CUCGUCAGUACGACU U U C U C G U C A G U A C G AC U B presents the secondary structure of R S presents the primary structure of R

4 RNA Representations CPM 2012, Helsinki CUCGUCAGUACGACU U U GC UA UACGAC CUU U C U C G U C A G U A C G AC U Arc annotated string Tree

5 RNA Secondary Structure CPM 2012, Helsinki A G U C A U C G C G U A U C C G A G C G C A C G AC G U C A G U A C G AC G C A U U A C G A Determines the activity and functionality of the RNA The secondary structures of RNA is highly researched Usually more preserved during evolution

6 RNA Structure CPM 2012, Helsinki A G U C A U C G C G U A U C C G A G C G C A C G AC G U C A G U A C G AC G C A U U A C G A Predicting the secondary structure of RNA molecule is a difficult task The structure is sometimes given in a non-fixed form, where each base pair has a probability ≤ 1 to exist in the RNA

7 Nested Structure CPM 2012, Helsinki CUCGUCAGUACGACU U U GC UA UACGAC CUU U C U C G U C A G U A C G AC U In all of these examples, the structure of R is Nested: Each base can be connected by a bond connection to at most one other base, and there are no crossing arcs

8 Unlimited Structure CPM 2012, Helsinki Arc annotated substrings can represent Unlimited structures, as well CUACCGAGUCAGUACGACGCAUUAC

9 Bounded-Unlimited Structure CPM 2012, Helsinki Arc annotated substrings can represent Bounded-Unlimited structures: Each base can be connected to a constant number of other bases, CUACCGAGUCAGUACGACGCAUUAC and crossing arcs are allowed

10 RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Tree Edit Distance: Tai (’79) O(n 6 ) Zhang & Shasha (‘89) O(n 4 ) Klein (‘98) O(n 3 logn) Ma et al. (‘99) O(n 3 logn) Demaine et al. (‘07) O(n 3 ) CPM 2012, Helsinki

11 RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Tree Alignment: Jiang et al. (’95) Schirmer & Giegerich (‘11) Backofen et al. (‘07) Mohl et al. (’09) CPM 2012, Helsinki

12 RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Longest Arc Preserving Common Subsequence: Evans (’99) Lin et al. (’02) Alber et al. (’04) Jiang et al. (’04) CPM 2012, Helsinki

13 RNA Similarity Algorithms Many algorithms for finding similarity between RNA molecules use tree similarity algorithms GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC Similar Subforests Jansson & Peng (’11) CPM 2012, Helsinki

14 Exact Pattern Matching Problem In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules Pattern CPM 2012, Helsinki

15 Patterns in RNAs In this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules CPM 2012, Helsinki

16 Exact Pattern Matching Problem UCUACUCAGCGUACG Finding all maximal common structure-sequence regions between two RNAs UCAAGUCAGAGAACCCG Solved by Backofen & Siebert in O(n 2 ) for fixed Nested x Nested Structures CGUU AACU CPM 2012, Helsinki single base matchleft endpoint matchtype mismatch

17 Exact Pattern Matching Problem In this work, we solve the problem for non-fixed Nested x Nested Structures UCUACUCAGCGUACG UCAAGUCAGAGAACCCG CGUU AACU arc breaking CPM 2012, Helsinki

18 Arc Breaking Operation We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty GCCCGCUAAGAGGUUGAC single bases base pair CPM 2012, Helsinki

19 Arc Breaking Operation We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty G CC C G C U A A G A G G U U G A C single bases base pair CPM 2012, Helsinki

20 Arc Breaking We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty GC UA AU GC UACGAC UA CG UA UACGAC CG CGAAUC U A CPM 2012, Helsinki

21 Arc Breaking Patterns are now less restricting: CPM 2012, Helsinki

22 Exact Pattern Matching Algorithms We describe three algorithms for finding the local exact pattern matching between two RNAs: A simple O(n 4 ) algorithm (using ideas from Zhang & Shasha (‘89) ) An improved O(n 3 logn) algorithm (using ideas from Klein (‘98) ) An O(n 3 ) algorithm (using ideas from Demaine, Weimann et al. (‘07) ) CPM 2012, Helsinki

23 Exact Pattern Matching Algorithm Input: R1=(S1,B1) and R2=(S2,B2), |R1|=n, |R2|=m, n>m Output: Local exact pattern matching between R1 and R2 CPM 2012, Helsinki R1: R2:

24 Exact Pattern Matching Algorithm We compare each base pair from R1 with each base pair from R2, in increasing order of their sizes CPM 2012, Helsinki R1: R2:

25 Exact Pattern Matching Algorithm For each two base pairs we compute the matching inside the base pairs, and the extensions to their outsides CPM 2012, Helsinki …… ……

26 Matching Inside the Base Pairs Dynamic programming algorithm Similar to the LCS\Edit distance algorithms of strings CPM 2012, Helsinki

27 Matching Inside the Base Pairs On each comparison we compute only prefixes of the substrings and select the maximal score over 4 expressions : Match base pairs S1(i)==S2(j) ? CPM 2012, Helsinki i j bp 1 bp 2 1 1 + +

28 Matching Inside the Base Pairs Match single bases CPM 2012, Helsinki S1(i)==S2(j) ? i j bp 1 bp 2 1 1

29 Matching Inside the Base Pairs Delete from R1 CPM 2012, Helsinki i j bp 1 bp 2 1 1 i-1 Delete from R2

30 Matching Inside the Base Pairs On each comparison we compute the maximal match from left-to-right UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki … …… C … C i j 1 1

31 Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki On each comparison we compute the maximal match from right-to-left … …… C … C i j 1 1

32 Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki There are two tricky parts here: What happens when a mismatch occurs? C G … …… C … C i j 1 1

33 Matching Inside the Base Pairs CPM 2012, Helsinki UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG There are two tricky parts here: What happens when the matchings overlap? … …… C … C i j 1 1

34 Matching Inside the Base Pairs UCGAGAUAUUAACGCC UUCGAACAAUCUAAGUCUAG AG CPM 2012, Helsinki The solution: on each comparison we compute the best score going from both right-to-left and left-to-right … …… C … C i j 1 1

35 We only compare prefixes of the base pairs There are O(n 2 ) prefixes for each RNA Each comparison is computed in O(1) time The total time is O(n 4 ) Time Complexity CPM 2012, Helsinki

36 Extending the Match CPM 2012, Helsinki We compute the maximal pattern extension for all bases in R1 and all bases in R2 in one run. The time complexity: O(n 2 ) … … i j n m R1: R2:

37 Total Time Complexity CPM 2012, Helsinki Computing the pattern match inside all base pairs is done in O(n 4 ) Computing the pattern match extensions to the right and to the left is done in O(n 2 ) The total time complexity is O(n 4 ) + =

38 An O(n 3 logn) Algorithm CPM 2012, Helsinki The root base pair is marked light, and continue recursively: Select the maximal child base pair and mark it as heavy, mark the rest of the children as light C GAGCCCGGGU UCUAGGCCGAAUC We use Klein’s Tree Edit Distance (‘98) ideas: we decompose the largest RNA into heavy paths:

39 For each base pair we define its special substrings Special Substrings C GAGCCCGGGU UU C A C C ACCCGGGU U axyb CPM 2012, Helsinki C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UU C ACCCGGGU UU C A CC ACCCGGGU UU C A CGG G CACCCGGGUUUCA C The no. of special substrings of a base pair is: |bp| - |hp| + 1 Lemma (Sleator & Tarjan ‘83): There are O(nlog n) special substring in R of size n

40 We compare all O(n 2 ) substrings of R2 with O(nlogn) special substrings of R1 An O(n 3 logn) Algorithm C GAGCCCGGGU UU C A C C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UUC ACCCGGGU U axyb C ACCCGGGU UU C A CGG C ACCCGGGU UU C A C G CACCCGGGUUUCA C CPM 2012, Helsinki

41 The comparisons are made between the rightmost or leftmost bases, according to the special substring An O(n 3 logn) Algorithm CPM 2012, Helsinki C GAGCCCGGGU UU C A C C ACCCGGGU UU C A C ACCCGGGU UU C C ACCCGGGU UUC ACCCGGGU U axyb C ACCCGGGU UU C A CGG C ACCCGGGU UU C A C G CACCCGGGUUUCA C

42 The total number of compared substrings is O(n 3 logn), each one computed in O(1) time, which gives a total of O(n 3 logn) running time. An O(n 3 logn) Algorithm CPM 2012, Helsinki This algorithm works for Nested x Bounded-Unlimited structures also.

43 Based on Demaine et al. (‘07) algorithm we decompose both RNAs to heavy paths, the special substrings are decided on each base pairs comparison: the base pair that has the largest root light base pair, is the dominant one An O(n 3 ) Algorithm C GAGCCCGGGU UCUAGGCCGAAUC C AGCUGUGCU UCUCACUCG U 1 2 3 R1:R1: R2:R2: 5 C 4 6 7 8 9 A B C D E F CPM 2012, Helsinki A

44 The number of compared substrings is O(n 3 ) An O(n 3 ) Algorithm C GAGCCCGGGU UCUAGGCCGAAUC C AGCUGUGCU UCUCACUC G G U 1 2 3 R1:R1: R2:R2: 5 C 4 6 7 8 9 A B C D E F CPM 2012, Helsinki This algorithm can work with Nested X Nested structures only

45 Find the local approximate pattern matching between Nested x Nested structures in O(n 3 k 2 ) for k allowed mismatches Find the local approximate pattern matching between Nested x Bounded-Unlimited structures in O(n 3 k 2 logn) for k allowed mismatches Find the most similar sibling substructures between Nested x Nested structures in O(n 3 ) More Algorithms CPM 2012, Helsinki

46 KO! H YU A T N


Download ppt "Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian."

Similar presentations


Ads by Google