Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.

Similar presentations


Presentation on theme: "Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer."— Presentation transcript:

1 Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer Science, Systems and Communication University of Milano – Bicocca Milan, Italy

2 CPM 2003 - MoreliaGiulio Pavesi2 Why Is RNA So Interesting? After the completion of various genome projects, the attention of many researchers has shifted from coding to non – coding parts More than 95% of our genome is not coding: what about the rest? Non – coding RNA: RNA that is transcribed from DNA, but does not encode directly for a protein (tRNA, microRNA, etc.)

3 CPM 2003 - MoreliaGiulio Pavesi3 A Motivating Example Post-transcriptional regulation of gene expression

4 CPM 2003 - MoreliaGiulio Pavesi4 The Problem Functionally related RNA sequences present structural similarity, at least in some parts Given two or more RNA molecules, find similar (supposedly functional) structural elements in them Sequence similarity implies structure similarity, but this is not always that true for RNA..... Given two or more RNA sequences of unknown structure, find similar structural elements in them (motifs) Low sequence similarity can anyway correspond to high structure similarity

5 CPM 2003 - MoreliaGiulio Pavesi5 “Know Thine Enemy” RNA secondary structure: list of the base pairs among nucleotides in the sequences, such that: –No nucleotide takes part in more than a single base pair (usually, Watson – Crick pairs and wobble pairs G – T, i.e. canonical base pairs) –Base pairs never cross: if nucleotide i is bound to nucleotide j and k with l, then either i < j < k < l or i < k < l < j

6 CPM 2003 - MoreliaGiulio Pavesi6 RNA Secondary Structure.((..(((.((....)))))...(((.(((...)))...)))))

7 CPM 2003 - MoreliaGiulio Pavesi7 Motifs in RNA Secondary Structure Many functional motifs can be described by secondary structure alone Two types of similarity: –sequence similarity (in unpaired nucleotides, mainly) –structure similarity

8 CPM 2003 - MoreliaGiulio Pavesi8 Data Structures? When dealing with DNA or protein sequences, some significant advantages have been obtained by using suitable text— indexing structures (e.g. suffix trees) RNA secondary structure can be described by a string Is there a “good” structure that will do for RNA sequences, allowing us to consider sequence and structure at the same time?

9 CPM 2003 - MoreliaGiulio Pavesi9 Affix Trees Affix tree for string S = ATATC Suffix and prefix edges Suffix edges spell the substrings of string S Prefix edges (dotted) spell substrings of S -1 (the reverse Built in linear time Takes linear space

10 CPM 2003 - MoreliaGiulio Pavesi10 Affix Trees The affix tree of a string S indexes all the substrings of both S and S -1 Once a substring of S has been located in the tree, we can extend it to the right (by following suffix edges) and to the left (by following prefix edges) Good if we search for patterns in the sequences with some kind of symmetry

11 CPM 2003 - MoreliaGiulio Pavesi11 The Hairpin The basic element of RNA secondary structure is the hairpin (or stem— loop) structure The hairpin is symmetric!!!! (((((...... ))))) AGGTC CAGTCA GATCT

12 CPM 2003 - MoreliaGiulio Pavesi12 First Try Predict the secondary structure of each input sequence Build the affix tree for the folded sequences (in bracket notation) Search exhaustively for patterns describing hairpin structures (possibly with differences) Report those occurring in at least q sequences

13 CPM 2003 - MoreliaGiulio Pavesi13 Searching for Hairpins in Affix Trees For each loop size l: 1.Find l dots in the tree, on suffix edges (hairpin loop) 2.Add a base pair: a)Find a ) on suffix edges b)Find a ( on prefix edges 3.If the result appears in at least q sequences, jump to 2, else return from jump 4.Add internal loops: a)Find a dot on prefix edges: jump to 2; b)Find a dot on suffix edges: jump to 2;

14 CPM 2003 - MoreliaGiulio Pavesi14 Recursive Algorithm 1....(ok) 2 (....)(ok) 2 ((....))(ok) 2 (((....)))(no) 3a.((....))(ok) 2 (.((....)))(ok) On each path, we keep a pointer for the prefix edge, and another for the suffix edge Speed—up: represent the unpaired elements with a single symbol describing type and size, so to compare two symbols instead of two regions

15 CPM 2003 - MoreliaGiulio Pavesi15 Approximate Search We can allow some approximation: –Hairpin loops of different size (range value at step 1) –Internal loops of different size at the same position along the stem –Internal loops or bulges at different positions along the stem –Stems of different size (base pairs) –Any combination of the previous

16 CPM 2003 - MoreliaGiulio Pavesi16 Complexity Given a set of k folded sequences of overall length N : –Construction of the tree: O(N) –Annotation of the tree: O(kN) –Search: O(V(m)kN), where m is the length of the longest pattern found –V(m) depends on the degree of approximation –In practice, the most time consuming part is predicting the structure of the sequences

17 CPM 2003 - MoreliaGiulio Pavesi17 Does It Work? Test: Iron Responsive Element, located in the UTRs of mRNA coding for proteins involved in iron metabolism (e.g. ferritin, transferrin) Does it appear in all the predicted structures? Alas, it does not!!!!!!

18 CPM 2003 - MoreliaGiulio Pavesi18 Why? The “real structure”often does not correspond to the optimal one!!!! The motif “disappears” from the (supposedly) optimal structure

19 CPM 2003 - MoreliaGiulio Pavesi19 One Possible Solution Idea: for each sequence, consider also a number of alternative sub-optimal structures All the possible structures can be enumerated Check whether a motif appears in at least one alternative structure per sequence The affix tree can handle efficiently even hundreds of alternative structures per input sequence Downside: the number of potential secondary structures for a sequence of length n is O(2 n ) If similarity is not stringent, we have too many candidates

20 CPM 2003 - MoreliaGiulio Pavesi20 But..... If the same structure has to appear in a set of sequences, then the same pattern of complementary base pairs has to appear in the sequences (((((...... ))))) AGGTC CAGTCA GATCT GCGAG CAGTCT CTTGC CCCAG CAGTCA CTGGG

21 CPM 2003 - MoreliaGiulio Pavesi21 Idea! Instead of working on folded sequences, build the affix tree for the sequences alone, and find complementary base pairs on the fly The search can be implemented with the same parameters of the folded case

22 CPM 2003 - MoreliaGiulio Pavesi22 Building Hairpins on the Fly By working on unfolded sequences, the theoretical time complexity is higher, since different paths correspond to the same structure In practice it is much faster, since we do not have to run the prediction algorithm on the input sequences We need to “validate” the candidate structures, e.g. according to their energy

23 CPM 2003 - MoreliaGiulio Pavesi23 Post - Processing So far we have considered structure alone More than a single motif occurrence per sequence is often reported, especially if structural constraints are loose Post processing: compare the candidate occurrences by evaluating sequence similarity in unpaired elements Find the group of instances that are more similar at the sequence level

24 CPM 2003 - MoreliaGiulio Pavesi24 Results and Work in Progress The second approach gave better results, in terms of reliability and efficiency Candidate hairpins can be validated according to their energy value (more reliable, in this case!) Good results on “harder” tests Too many input parameters yet Extend to more complex structures


Download ppt "Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer."

Similar presentations


Ads by Google