Presentation is loading. Please wait.

Presentation is loading. Please wait.

MicroRNA The Computational Challenge Bioinformatics Seminar, March 9, 2005 By Yaron Levy.

Similar presentations


Presentation on theme: "MicroRNA The Computational Challenge Bioinformatics Seminar, March 9, 2005 By Yaron Levy."— Presentation transcript:

1 MicroRNA The Computational Challenge Bioinformatics Seminar, March 9, 2005 By Yaron Levy

2 Tree of RNA Types

3 miRNA Biological Process

4 Micro RNA – Computational Approach Problem 1: Finding putative microRNA from a sequence – Horesh et al, using suffix trees data structure Problem 2: Computing secondary structure of a given sequence – Zuker & Steigler, minimum free energy, using dynamic programming Problem 3: miRNA predicting algorithms – Lim et al, MiRscan Problem 4: Predicting miRNA target genes – Lewis et al, TargetScan

5 Problem 1 Find these

6 Problem 1: Finding putative microRNA from a sequence A naïve idea: slide a “window” of size L over the sequence of size N, looking for stems of size S. – Computationally O(NL+NS) – too much A better approach – using a suffix tree.

7 S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 $ YALAM$ M $ ALAYALAM$ $M YALAM$ $M YALAM$ $M YALAM$ A AL LA 6 2 847 3 1 9 5 10 What is a suffix tree?

8 Suffix tree properties For a string S of length n, there are n leaves and at most n internal nodes. – therefore requires only linear space Each leaf represents a unique suffix. Concatenation of edge labels from root to a leaf spells out the suffix. Each internal node represents a distinct common prefix to at least two suffixes.

9 Finding a (short) Pattern in a (long) String 1. Build a suffix tree of the string. 2. Starting from the root, traverse a path matching characters of the pattern. 3. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string.

10 Find “ALA” $ YALAM$ M $ ALAYALAM$ M$ YALAM$ M$ YALAM$ M$ YALAM$ A AL LA 6 2 847 3 1 9 5 10 Two matches - at 6 and 2 Finding a Pattern in a String

11 Generalized Suffix Tree $ O ND W I $OG D $OGI OW$ $OG ND $OGI OW$ $OGI OW$ $W $ INDOW$ $ (2, 3)(1, 4) (2, 5) (2, 4) (2, 1)(1, 2) (2, 2)(1, 3)(1, 5)(2, 6) (1, 6)(1, 1) (1, 7) (2, 7) WINDOW$ INDIGO$ 1234567 1234567

12 Horesh et al – using a generalized suffix tree for finding putative microRNA’s Assumptions: – At least a triple repeat is necessary: 2 for the stems of the hairpin – close to each other in the sequence, and as inverted repeat of each other The rest are target genes – can be anywhere – The repeats must be fully matched – no mismatches are allowed This is more of a constraint

13 Horesh et al – the algorithm Construct a generalized suffix tree of the original sequence and the inverted repeat sequence. Preprocess the suffix tree for calculating: – Length of suffixes – Number of repeats – Index of suffix in sequence With these attributes for each node, along with the indices of the suffixes in the sequence, it is possible to find the requested triple (or more) repeats. – Computationally efficient O(N)

14 a banana na 1. Build a suffix tree 2. Scan the tree in a PreOrder traversal (all parents are visited before their sons) The length of a prefix a node represent is: node.len = father.len + node.Length of the sequence fragment it carries (root is 0) 0 1 3 2 2 4 1 35 6

15 a banana na 3. Scan the tree in a PostOrder traversal (all sons are visited before their parents) The number of repeats of a prefix a node represent is: node.repeats = Sum of sons repeats (leaf is 1) 6 3 2 2 1 1 1 1 1 1

16 4. Scan the tree again, For every node that represents a prefix longer than SIZE (22 for example), and has two repeats or more; Print its length and repeats and print the indexes of its leaves. Now every node carries the length of the prefix It represents and the number of leaves below it. (the number of repeats it is their prefix). All sections are done in linear time ! a na 1 3 1 35

17 Problem 1 conclusions The problem is not trivial! Suffix trees are an elegant solution, providing: – No mismatches are allowed (not really biologically realistic) – Enough memory to store the large data structure

18 Problem 2 How do these fold?

19 Problem 2: Computing secondary structure of a given sequence Approaches to RNA secondary structure prediction: – comparative sequence analysis – prediction from base sequence find minimum free energy (MFE) structure

20 Free energy model free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops

21 Free energy model free energy of structure (at fixed temperature, ionic concentration) = sum of loop energies standard model uses experimentally determined thermodynamic parameters where available; extrapolations for long loops

22 On the MFE approach MFE approach ignores folding pathway, metal ions, nonstandard bonds “some species can remain kinetically trapped in nonequilibrium states… we expect that most RNA’s exist naturally in their thermodynamically most stable configurations” – Tinoco and Bustamante, J. Mol. Biol. 1999.

23 Why is MFE secondary structure prediction hard? MFE structure can be found by calculating free energy of all possible structures but, number of potential structures grows exponentially with the number, n, of bases structures can be arbitrarily complex success for restricted classes of structures

24 Predicting MFE pseudoknot free structures Dynamic programming avoids explicit enumeration of all pseudoknot free structures ( Zuker & Stiegler 1981 ) Suboptimal folds, probabilities of base pairings can also be calculated software: mfold, Vienna package

25 Dynamic programming (Zuker & Steigler) Based on the “more is less” principle: by calculating more than you need, less work is needed overall Construct MFE structure for whole strand from MFE structures for substrands Running time is O(n 3 )

26 RNA folding with dynamic programming Assume a function W(i,j) which is the MFE for the sequence starting at i and ending at j (i<j) Define sigma as the MFE function for the simple cases, where, for example a base pair’s score is less than a non-pair Consider 4 recursion possibilities: – i,j are a base pair, added to the structure for i+1..j-1 Define this as V(i,j) – i is unpaired, added to the structure for i+1..j – j is unpaired, added to the structure for i..j-1 – i,j are paired, but not to each other; the structure for i..j adds together sub-structures for 2 sub-sequences: i..k and k+1..j a bifurcation (i<k<j)

27 Dynamic programming (Zuker and Steigler) W(i,j): MFE structure of substrand from i to j V(i,j): MFE structure of substrand from i to j, in which i-th and j-th bases are paired ij W(i,j) ij V(i,j)

28 = ijk ij W(i,j) W(i,k) ij V(i,j) k+1 W(k+1,j) Recurrences min

29 Recurrences ij ijkl i ji+1j-1kk+1 min = ij

30 = ijkijijk+1 Recurrences min ijijijkli ji+1j-1kk+1 = min

31 What is actually being done? Simple base pair maximization is a poor scoring scheme for RNA structure prediction. It is more plausible that an RNA adopts a globally minimum energy structure, not the structure with the maximum number of base pairs. Developed the thermodynamic model in conjunction with the development of DP – independence assumptions in the thermodynamic model's terms have been made compatible with the independence assumptions needed for recursive dynamic programming algorithms to work. Energy minimization algorithms become somewhat complex, with more detailed recursions that distinguish different lengths and types of loops, and which score base pairs according to nearest-neighbor stacking interactions with adjacent base pairs. Nonetheless, the mechanics of the algorithm are pretty much the same

32 Problem 2 conclusions RNA secondary structure finding is a hard problem – exponential number of possibilities Several heuristics claim to achieve relatively good success rates – Specifically, MFE based algorithms are believed to be ~70% accurate on structures without pseudoknots.

33 Problem 3 How to predict these?

34 Problem 3: miRNA predicting algorithms Lim et al. developed a machine learning tool called MiRscan to help identify new miRNA genes This program looks at hairpin sequences conserved between species (C. elegans and C. briggsae) The program is given a training set of known miRNAs in C. elegans This data is then used to identify which conserved hairpin sequences are most similar to the training data.

35 MiRscan Algorithm The MiRscan algorithm examines several features of the hairpin The total score computed by summing the score of each feature The score for each feature is computed by dividing the frequency of the given value in the training set to its overall frequency

36 MiRscan – Relative importance of hairpin features Certain features were found to be more useful than others in distinguishing miRNAs

37 MiRscan – Testing the algorithm In order to test their algorithm, Lim et al. ran MiRscan on the ~36,000 conserved hairpins in the C. elegans and C. briggsae genomes The 50 known miRNA genes conserved between C. elegans and C. briggsae were used as a training set 35 sequences received a MiRscan score greater than the mean score of the known genes These sequences were given special attention in the experimental portion of this research

38 MiRscan – Results

39 MiRscan – Results example Flanking sequence of control and real matches in the UTRs.

40 Problem 3 conclusions Predicting miRNA genes is a hot subject! – Algorithms use machine learning techniques to predict genes – Candidate genes can be biologically verified to be miRNA genes. Although this process may be slow, it gives feedback and allows refinement of techniques and better predictions – Hundreds (thousands?) of new miRNA genes are suspected to be found in the (near?) future! – Commercial companies are performing these kinds of processes for money…

41 Problem 4 What are the targets these bind to?

42 Problem 4: Predicting target genes Mammals/vertebrates – Lots of known miRNAs – Mostly unknown target genes Initial method outline – Look at conserved miRNAs – Look for conserved target sites

43 miRNAs in animals 0.5%-1.0% of predicted genes encode miRNA (!!) – One of the more abundant regulatory classes Tissue-specific or developmental stage- specific expression High evolutionary conservation

44 TargetScan Algorithm by Lewis et al 2003 The Goal – a ranked list of candidate target genes Stage 1: Search UTRs in one organism – Bases 2-8 from miRNA = “miRNA seed” – Perfect Watson-Crick complementarity – No wobble pairs (G-U) – 7nt matches = “seed matches”

45 TargetScan Algorithm Stage 2: Extend seed matches – Allow G-U (wobble) pairs – Both directions – Stop at mismatches

46 TargetScan Algorithm Stage 3: Optimize basepairing – Remaining 3’ region of miRNA – 35 bases of UTR 5’ to each seed match – RNAfold program (Hofacker et al 1994)

47 Stage 4: Folding free energy (G) assigned to each putative miRNA:target interaction Assign rank to each UTR Repeat this process for each of the other organisms with UTR datasets TargetScan Algorithm

48 TargetScan Program Flow

49 TargetScan - Results for mammals Database of 79 miRNA’s searched against human, mouse, and rat orthologous 3’ UTRs 451 miRNA:target interactions predicted for 400 unique genes Average 5.7 targets per miRNA Signal:noise ratio of 3.2:1

50 TargetScan - Biological relevance Hypothesis: 5’ conservation of miRNAs is important for mRNA target recognition – Highest signal:noise ratio observed when seed positioned close to 5’ end Hypothesis: highly conserved miRNAs are more involved in regulation – High degree of conservation -> more predicted targets – Membership in large miRNA family -> more predicted targets

51 miRNA Targets - Not the end of the story… Many programs claim to discover miRNA targets in mammals: – miRanda - Enright et al, SKI – DIANA-MicroT - Hatzigeorgiou et al, UPenn – rna22 - Rigoutsos et al, IBM – PicTar - Rajewsky et al, NYU

52 Problem 4 conclusion: algorithms comparison NYAS Competition (Feb 17, 2005) – Task: given 2 miRNAs, find mammalian targets Widely differing results, from 1 target to ~500 targets! Very little overlap So who’s “right”? – Currently correct targets unknown…

53 Thank You


Download ppt "MicroRNA The Computational Challenge Bioinformatics Seminar, March 9, 2005 By Yaron Levy."

Similar presentations


Ads by Google