RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Predicting RNA Structure and Function
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Introduction to Bioinformatics - Tutorial no. 9 RNA Secondary Structure Prediction.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
RNA Structure Prediction Rfam – RNA structures database RNAfold – RNA secondary structure prediction tRNAscan – tRNA prediction.
Improving Free Energy Functions for RNA Folding RNA Secondary Structure Prediction.
RNA Secondary Structure Prediction
Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.
Presenting: Asher Malka Supervisor: Prof. Hermona Soreq.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
Predicting RNA Structure and Function. Following the human genome sequencing there is a high interest in RNA “Just when scientists thought they had deciphered.
[Bejerano Fall10/11] 1.
. Class 5: RNA Structure Prediction. RNA types u Messenger RNA (mRNA) l Encodes protein sequences u Transfer RNA (tRNA) l Adaptor between mRNA molecules.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Predicting RNA Structure and Function
Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.
RNA.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
Non-coding RNA gene finding problems. Outline Introduction RNA secondary structure prediction RNA sequence-structure alignment.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Strand Design for Biomolecular Computation
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
1 RNA Bioinformatics Genes and Secondary Structure Anne Haake Rhys Price Jones & Tex Thompson.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
© Wiley Publishing All Rights Reserved. RNA Analysis.
Gene expression DNA  RNA  Protein DNA RNA Protein Replication Transcription Translation Degradation Initiation Elongation Processing Export Initiation.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
RNA Structure Prediction
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 6:
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
CS5263 Bioinformatics RNA Secondary Structure Prediction.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
Prediction of Secondary Structure of RNA
RNA Structure Prediction RNA Structure Basics The RNA ‘Rules’ Programs and Predictions BIO520 BioinformaticsJim Lund Assigned reading: Ch. 6 from Bioinformatics:
Lecture 11. RNA Secondary Structure Prediction
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
RNA Structure Prediction
For Prediction of microRNA Genes Vertebrate MicroRNA Genes Lee P. Lim, et. al. SCIENCE 2003 The microRNAs of Caenorhabditis elegans Lee P. Lim, et al GENES.
Rapid ab initio RNA Folding Including Pseudoknots via Graph Tree Decomposition Jizhen Zhao, Liming Cai Russell Malmberg Computer Science Plant Biology.
RNAs. RNA Basics transfer RNA (tRNA) transfer RNA (tRNA) messenger RNA (mRNA) messenger RNA (mRNA) ribosomal RNA (rRNA) ribosomal RNA (rRNA) small interfering.
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Lecture 21 RNA Secondary Structure Prediction
Lab 8.3: RNA Secondary Structure
Transcription: DNA  mRNA
Predicting RNA Structure and Function
RNA Secondary Structure Prediction
RNA Secondary Structure Prediction
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
Noncoding RNA roles in Gene Expression
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
Presentation transcript:

RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

What are RNA and mRNA?  Traditional role as messenger molecule (mRNA)  RNA is a polymer of nucleotides A, U, C, and G transcribed from DNA GATTACA GAUUACA

What is RNA secondary structure/folding? bulge loop helix (stem) hairpin loop internal loop multi-branch loop

Pseudoknots  Pseudoknots will not be treated in this talk.  Not dealt with by either paper.

non-coding RNA (RNA genes)  RNA enzymes: catalytic RNA  Ribosomal RNA (rRNA)  Transfer RNA (tRNA)  RNAi: RNA mediated gene regulation  Micro RNA (miRNA)  Short-interfering RNA (siRNA)  Alternative splicing: small-nuclear RNA (snRNA)  Others: snoRNA, eRNA, srpRNA, tmRNA, gRNA Structure essential to function for many ncRNAs

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

CONTRAfold Problem: Given an RNA sequence, predict the most likely secondary structure AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA

How does CONTRAfold work?  CONTRAfold looks at features that indicate a good structure  C-G base pairings  A-U base pairings  Helices of length 5  Hairpin loops of size 9  Bulge loops of size 2  CG/GC Base-pair stacking interactions For example:  These examples are called thermodynamic parameters because they represent free energy values

How does CONTRAfold choose a structure?  Every feature f i is associated with a weight w i.  The probability of a structure y, given a sequence x, is determined by the following relationship: ) ( exp structuresequence weight of Feature i # of occurrences of feature i, in structure y generated from sequence x

How does CONTRAfold choose a structure? Cont’d  Considers all structures and finds optimal structure via dynamic programming in O(n 3 )  Added bonus: probability associated with each base Low confidence bases lighter High confidence bases darker

Parameter γ allows trade-off between sensitivity and specificity Sensitivity = # correct base pairings # true base pairings Specificity = # correct base pairings # predicted base pairings  = 1  = 8  = 1024 AUCCCCGUAUCGAUC AAAAUCCAUGGGUAC CCUAGUGAAAGUGUA UAUACGUGCUCUGAU UCUUUACUGAGGAGU CAGUGAACGAACUGA

CONTRAfold learns how to predict good structures  CONTRAfold trains on set of published examples of known RNA structures taken from a database called Rfam (RNA families)  CONTRAfold learns the relative value, or weight, of each of its features  CONTRAfold determines the weight for each feature that maximizes its performance on the training set.  A training set is a collection of known correct solutions that a program learns from.

CONTRAfold Performance

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

Other Methods Stochastic context-free grammars Physics-based models

 All features reflect thermodynamic interactions  Features experimentally determined in lab, rather than learned Disadvantages to CONTRAfold  Thermodynamic weights difficult to calculate  No incorporation of non-thermodynamic features  Cannot be tailored to specific families of RNAs since weights always the same  Cannot trade off between sensitivity and specificity  No associated probabilities with each pair-bonding  Until CONTRAfold, best performing method

acuSag Stochastic context-free grammars  Based on grammar rules with associated probabilities S  aSu | cSg | aS | uS | … | Su | SS | ε P S aSaS acSgacSg acuSuag acugScuag acuguScuag acuguaScuag acuguauScuag acuguaucuag.(((...).))  Let’s generate a structure for the sequence acuuauuag acuguacuag.(((..).)) acugucuag.(((.).)) acugcuag.((().)) acuuag.((.)) acuag.(()) acg.() a.a.  We select the set of transformations that highest probability of generating the input sequence. This set gives us our structure.

Stochastic context-free grammars cont’d Disadvantages to CONTRAfold  Grammar rules of SCFG less expressive than features of CONTRAfold or physics-based methods  Poor accuracy: always dominated by physics-based models  Like CONTRAfold, transformation probabilities can be automatically trained  Therefore, they can also be optimized to specific datasets  Provide an associated probability with a given structure

Advantages of CONTRAfold  High accuracy  Automated training of parameters  Can be tuned to specific data  Provides associated probabilities for each base-pairing  Ability to control sensitivity/specificity trade-off  Can incorporate both physics-based and non-thermodynamic parameters

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms Score for optimal structure from base i to base j Base i is unpaired, consider pairing between i+1 and j We want the highest scoring fold Base j is unpaired, consider pairing between i and j-1 δ (i, j) = score for a pairing between i and j.

How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms Pair i and j. Now consider pairing between i+1 and j-1.

How is RNA folding done? Simple Nussinov Folding Algorithm  Only scores interactions between paired bases  Useful for demonstrating general structure of more complex folding algorithms i and j begin a bifurcation. Consider every possible bifurcation point k. Sum scores from each folded structure.

How is RNA folding done?  What is the runtime of the Nussinov algorithm?  All possible value of iO(n)  All possible values of j O(n) For a given sequence of length n = j – i we must consider: For each i we must consider: For each i, j pair we must consider:  All possible values of k O(n) O(n) * O(n) * O(n) → O(n 3 )

A more sophisticated algorithm  We want to take into account more advanced features than just base-pairings.

i j What is V(i, j)? eh = Energy of a hairpin closed at i and j

What is V(i, j)? es = Energy of stacked pair i, j and i+1, j-1 i j

What is V(i, j)? ebi = Energy of a bulge or interior loop that begins at i, j and is closed at i ’, j ’ i j i’i’ j’j’

What is V(i, j)? Same old bifurcation equation, but i is paired to j

What is its runtime?  Still only O(n 3 ) because we are only recursing on i, j, and k  This equation theoretically O(n), however, it is standard to bound RNA interior loops by a constant (30), making it O(1)

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

CandidateFold  Same folding as complex model in O(n 2 ψ(n)), where ψ(n) is shown to a constant  What does it do?  Imposes some constraints on W and V  How does it do it? From WFrom V  Rather than trying all k, they keep a list of candidate positions reducing this step to O(1) time

CandidateFold  Much faster RNA folding  What is the advantage of CandidateFold?  Accessible motif finding  What is an application of high-speed RNA folding?

Presentation Overview  Background on RNA secondary structure prediction  CONTRAfold: probabilistic RNA folding  How is RNA folding done from an algorithmic perspective?  Other RNA folding methods: Physics-based methods and SCFGs  CandidateFold: RNA folding in O(n 2 )  Genome-wide accessible motif detection

What is an RNA regulatory motif?  Motif: A conserved sequence element  A regulator binds to a regulatory motif  RNA regulatory motif: A motif used to regulate translation G A U U A C A... RNA Regulatory motif (AUUAC)  Regulatory protein  Micro RNA U A A U G microRNA

What is an accessible motif?  If a sequence is part of an intramolecular hybridization, it is unlikely to bind to regulators  We define a motif as “accessible” if none of its nucleotides is hybridized as part of the folding

Accessible motifs cont’d  Therefore, only accessible sequences should be scanned for regulatory motifs

Accessible motifs cont’d  Therefore, only accessible sequences should be scanned for regulatory motifs.

How do Wexler et al. detect regulatory motifs?  Stage 1: Process sequence set G to extract all “accessible windows”  Run sliding window of size k across each mRNA sequence  Find the minimal energy fold for the sequence, assuming none of the bases in the window are paired  If the energy of this folding minus the energy of a normal folding of the mRNA < δ, then accept the window Problem: Given a set of mRNAs G, a parameter k denoting motif window size, and a pre-defined energy threshold δ, find the regulatory motifs  Stage 2: Search for regulatory motifs among the “accessible windows”  Motif finding will be discussed in later lectures

Results: Degradation Related Motifs

Results: Tissue Specific microRNAs Silique: A long, slender, many-seeded, cylindrical fruit of the Mustard Family

The End

Works Cited CB Do, DA Woods, S Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14): e90-e98, Y Wexler, C Zilberstein, M Ziv-Ukelson. A Study of Accessible Motifs and RNA Folding Complexity. Recomb 2006, LNBI 3909: , 2006.