Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University.

Similar presentations


Presentation on theme: "RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University."— Presentation transcript:

1

2 RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University

3 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

4 What are RNA and mRNA?  Traditional role as messenger molecule (mRNA)  RNA is a polymer of nucleotides A, U, C, and G transcribed from DNA GATTACA GAUUACA

5 What is RNA secondary structure/folding? bulge loop helix (stem) hairpin loop internal loop multi-branch loop

6 Pseudoknots  Pseudoknots will not be treated in this talk.  Not dealt with by either paper.

7 non-coding RNA (RNA genes)  RNA enzymes: catalytic RNA  Ribosomal RNA (rRNA)  Transfer RNA (tRNA)  RNAi: RNA mediated gene regulation  Micro RNA (miRNA)  Short-interfering RNA (siRNA)  Alternative splicing: small-nuclear RNA (snRNA)  Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA Structure essential to function for many ncRNAs

8 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

9 CONTRAfold Problem: Given an RNA sequence, predict the most likely secondary structure AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA

10 How does CONTRAfold work?  CONTRAfold looks at features that indicate a good structure  C-G base pairings  A-U base pairings  Helices of length 5  Hairpin loops of size 9  Bulge loops of size 2  CG/GC Base-pair stacking interactions For example:  These examples are called thermodynamic parameters because they represent free energy values

11 How does CONTRAfold choose a structure?  Every feature f i is associated with a weight w i.  The probability of a structure y, given a sequence x, is determined by the following relationship: ) ( exp structuresequence weight of Feature i # of occurrences of feature i, in structure y generated from sequence x

12 How does CONTRAfold choose a structure? Cont’d  Considers all structures and finds optimal structure via dynamic programming in O(n 3 )  Added bonus: probability associated with each base Low confidence bases lighter High confidence bases darker

13 Parameter γ allows trade-off between sensitivity and specificity Sensitivity = # correct base pairings # true base pairings Specificity = # correct base pairings # predicted base pairings  = 1  = 8  = 1024 AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA

14 CONTRAfold learns how to predict good structures  CONTRAfold trains on set of published examples of known RNA structures taken from a database called Rfam (RNA families)  CONTRAfold learns the relative value, or weight, of each of its features  CONTRAfold determines the weight for each feature that maximizes its performance on the training set.  A training set is a collection of known correct solutions that a program learns from.

15 CONTRAfold Performance

16 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

17 Other Methods Stochastic context-free grammars Physics-based models

18  All features reflect thermodynamic interactions  Features experimentally determined in lab, rather than learned Disadvantages to CONTRAfold  Thermodynamic weights difficult to calculate  No incorporation of non-thermodynamic features  Cannot be tailored to specific families of RNAs since weights always the same  Cannot trade off between sensitivity and specificity  No associated probabilities with each pair-bonding  Until CONTRAfold, best performing method

19 acuSag Stochastic context-free grammars  Based on grammar rules with associated probabilities S  aSu | cSg | aS | uS | … | Su | SS | ε P.21.15.11.08.03.22.02 S aSaS acSgacSg acuSuag acugScuag acuguScuag acuguaScuag acuguauScuag acuguaucuag.(((...).))  Let’s generate a structure for the sequence acuuauuag acuguacuag.(((..).)) acugucuag.(((.).)) acugcuag.((().)) acuuag.((.)) acuag.(()) acg.() a.a.  We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure.

20 Stochastic context-free grammars cont’d Disadvantages to CONTRAfold  Grammar rules of SCFG less expressive than features of CONTRAfold or physics-based methods  Poor accuracy: always dominated by physics-based models  Like CONTRAfold, transformation probabilities can be automatically trained  Therefore, they can also be optimized to specific datasets  Provide an associated probability with a given structure

21 Advantages of CONTRAfold  High accuracy  Automated training of parameters  Can be tuned to specific data  Provides associated probabilities for each base-pairing  Ability to control sensitivity/specificity trade-off  Can incorporate both physics-based and non-thermodynamic parameters

22 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

23 How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms Score for optimal structure from base i to base j Base i is unpaired, consider pairing between i+1 and j We want the highest scoring fold Base j is unpaired, consider pairing between i and j-1 δ (i, j) = score for a pairing between i and j.

24 How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms Pair i and j. Now consider pairing between i+1 and j-1.

25 How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.

26 How is RNA folding done?  What is the runtime of the Nussinov algorithm?  All possible value of iO(n)  All possible values of j O(n) For a given sequence of length n = j – i we must consider: For each i we must consider: For each i, j pair we must consider:  All possible values of k O(n) O(n) * O(n) * O(n) → O(n 3 )

27 A more sophisticated algorithm  We want to take into account more advanced features than just base-pairings.

28 i j What is V(i, j)? eh = Energy of a hairpin closed at i and j

29 What is V(i, j)? es = Energy of stacked pair i, j and i+1, j-1 i j

30 What is V(i, j)? ebi = Energy of a bulge or interior loop that begins at i, j and is closed at i ’, j ’ i j i’i’ j’j’

31 What is V(i, j)? Same old bifurcation equation, but i is paired to j

32 What is its runtime?  Still only O(n 3 ) because we are only recursing on i, j, and k  This equation theoretically O(n), however, it is standard to bound RNA interior loops by a constant (30), making it O(1)

33 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

34 CandidateFold  Same folding as complex model in O(n 2 ψ(n)), where ψ(n) is shown to a constant  What does it do?  Imposes some constraints on W and V  How does it do it? From WFrom V  Rather than trying all k, they keep a list of candidate positions reducing this step to O(1) time

35 CandidateFold  Much faster RNA folding  What is the advantage of CandidateFold?  Accessible motif finding  What is an application of high-speed RNA folding?

36 Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

37 What is an RNA regulatory motif?  Motif: A conserved sequence element  A regulator binds to a regulatory motif  RNA regulatory motif: A motif used to regulate translation G A U U A C A... RNA Regulatory motif (AUUAC)  Regulatory protein  Micro RNA U A A U G microRNA

38 What is an accessible motif?  If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators  We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding

39 Accessible motifs cont’d  Therefore, only accessible sequences should be scanned for regulatory motifs

40 Accessible motifs cont’d  Therefore, only accessible sequences should be scanned for regulatory motifs.

41 How do Wexler et al. detect regulatory motifs?  Stage 1: Process sequence set G to extract all “accessible windows”  Run sliding window of size k across each mRNA sequence  Find the minimal energy fold for the sequence, assuming none of the bases in the window are paired  If the energy of this folding minus the energy of a normal folding of the mRNA < δ, then accept the window Problem: Given a set of mRNAs G, a parameter k denoting motif window size, and a pre-defined energy threshold δ, find the regulatory motifs  Stage 2: Search for regulatory motifs among the “accessible windows”  Motif finding will be discussed in later lectures

42 Results: Degradation Related Motifs

43 Results: Tissue Specific microRNAs Silique: A long, slender, many-seeded, cylindrical fruit of the Mustard Family

44 The End

45 Works Cited CB Do, DA Woods, S Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14): e90-e98, 2006. Y Wexler, C Zilberstein, M Ziv-Ukelson. A Study of Accessible Motifs and RNA Folding Complexity. Recomb 2006, LNBI 3909: 473-487, 2006.


Download ppt "RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University."

Similar presentations


Ads by Google