CS5263 Bioinformatics RNA Secondary Structure Prediction.

Slides:



Advertisements
Similar presentations
B. Knudsen and J. Hein Department of Genetics and Ecology
Advertisements

RNA Secondary Structure Prediction
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
Stochastic Context Free Grammars for RNA Modeling CS 838 Mark Craven May 2001.
6 - 1 Chapter 6 The Secondary Structure Prediction of RNA.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Predicting RNA Structure and Function
RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
RNA secondary structure prediction and runtime optimization Greg Goldgof October 5, 2006 CS374 Presentation Stanford University.
Some new sequencing technologies
RNA Secondary Structure Prediction
Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.
Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.
RNA structure analysis Jurgen Mourik & Richard Vogelaars Utrecht University.
Predicting RNA Structure and Function. Following the human genome sequencing there is a high interest in RNA “Just when scientists thought they had deciphered.
[Bejerano Fall10/11] 1.
. Class 5: RNA Structure Prediction. RNA types u Messenger RNA (mRNA) l Encodes protein sequences u Transfer RNA (tRNA) l Adaptor between mRNA molecules.
CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.
Predicting RNA Structure and Function
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.
RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.
RNA.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
RNA-Seq and RNA Structure Prediction
RNA informatics Unit 12 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
Non-coding RNA gene finding problems. Outline Introduction RNA secondary structure prediction RNA sequence-structure alignment.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012 Adapted from Haixu Tang.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
RNA folding & ncRNA discovery I519 Introduction to Bioinformatics, Fall, 2012.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
RNA Secondary Structure Prediction. 16s rRNA RNA Secondary Structure Hairpin loop Junction (Multiloop)Bulge Single- Stranded Interior Loop Stem Image–
© Wiley Publishing All Rights Reserved. RNA Analysis.
Lecture 9 CS5661 RNA – The “REAL nucleic acid” Motivation Concepts Structural prediction –Dot-matrix –Dynamic programming Simple cost model Energy cost.
RNA secondary structure RNA is (usually) single-stranded The nucleotides ‘want’ to pair with their Watson-Crick complements (AU, GC) They may ‘settle’
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 6:
RNA Structure Prediction Including Pseudoknots Based on Stochastic Multiple Context-Free Grammar PMSB2006, June 18, Tuusula, Finland Yuki Kato, Hiroyuki.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.
Doug Raiford Lesson 7.  RNA World Hypothesis  RNA world evolved into the DNA and protein world  DNA advantage: greater chemical stability  Protein.
RNA folding & ncRNA discovery
RNA Structure Prediction RNA Structure Basics The RNA ‘Rules’ Programs and Predictions BIO520 BioinformaticsJim Lund Assigned reading: Ch. 6 from Bioinformatics:
Motif Search and RNA Structure Prediction Lesson 9.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
RNA Structure Prediction
Rapid ab initio RNA Folding Including Pseudoknots via Graph Tree Decomposition Jizhen Zhao, Liming Cai Russell Malmberg Computer Science Plant Biology.
RNA secondary structure Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3- Combinatorial Motif Finding Lecture 4-Statistical Motif Finding.
RNAs. RNA Basics transfer RNA (tRNA) transfer RNA (tRNA) messenger RNA (mRNA) messenger RNA (mRNA) ribosomal RNA (rRNA) ribosomal RNA (rRNA) small interfering.
AAA AAAU AAUUC AUUC UUCCG UCCG CCGG G G Karen M. Pickard CISC889 Spring 2002 RNA Secondary Structure Prediction.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
Genome Annotation (protein coding genes)
Stochastic Context-Free Grammars for Modeling RNA
Lecture 21 RNA Secondary Structure Prediction
Protein Synthesis Part 3
Predicting RNA Structure and Function
RNA Secondary Structure Prediction
RNA Secondary Structure Prediction
Protein Synthesis Part 3
Stochastic Context-Free Grammars for Modeling RNA
Protein Synthesis Part 3
RNA folding & ncRNA discovery
RNA 2D and 3D Structure Craig L. Zirbel October 7, 2010.
CISC 467/667 Intro to Bioinformatics (Spring 2007) RNA secondary structure CISC667, S07, Lec19, Liao.
Presentation transcript:

CS5263 Bioinformatics RNA Secondary Structure Prediction

Outline Biological roles for RNA RNA secondary structure –What’s “secondary structure”? –How is it represented? –Why is it important? How to predict?

Central dogma The flow of genetic information DNA RNAProtein transcription translation Replication

Classical Roles for RNA mRNA tRNA rRNA Ribosome

“Semi-classical” RNA snRNA - small nuclear RNA (60-300nt), involved in splicing (removing introns), etc. RNaseP - tRNA processing (~300 nt) SRP RNA - signal recognition particle RNA: membrane targeting (~ nt) tmRNA - resetting stalled ribosomes, destroy aberrant mRNA Telomerase - ( nt) snoRNA - small nucleolar RNA (many varieties; nt)

Non-coding RNAs Dramatic discoveries in last 10 years 100s of new families Many roles: regulation, transport, stability, catalysis, … siRNA: Small interfering RNA (Nobel prize 2006) and miRNAs: both are ~21-23 nt –Regulating gene expression –Evidence of disease- association 1% of DNA codes for protein, but 30% of it is copied into RNA, i.e. ncRNA >> mRNA

Take-home message RNAs play many important roles in the cell beyond the classical roles –Many of which yet to be discovered RNA functions are determined by structures

Example: Riboswitch Riboswitch: an mRNA regulates its own activity

RNA structure Primary: sequence Secondary: base-pairing Tertiary: 3D shape

RNA base-pairing Watson-Crick Pairing –C-G~3kcal/mole –A-U~2kcal/mole “Wobble Pair” G – U ~1kcal/mole Non-canonical Pairs

tRNA structure

Secondary structure prediction Given: CAUUUGUGUACCU…. Goal: How can we compute that?

Hairpin Loops Stems Bulge loop Interior loops Multi-branched loop Terminology

Pseudoknot Makes structure prediction hard. Not considered in most algorithms. 5’ ’ ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc 5’--3’

The Nussinov algorithm Goal: maximizing the number of base- pairs Idea: Dynamic programming –Loop matching –Nussinov, Pieczenik, Griggs, Kleitman ’78 Too simple for accurate prediction, but stepping-stone for later algorithms

The Nussinov algorithm Problem: Find the RNA structure with the maximum (weighted) number of nested pairings Nested: no pseudoknot A G A C C U C U G G G CG GC AG UC U A U G C G A A C G C GU CA UC AG C U G G A A G A A G G G A G A U C U U C A C C A A U A C U G A A U U G C A ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG

The Nussinov algorithm Given sequence X = x 1 …x N, Define DP matrix: F(i, j) = maximum number of base-pairs if x i …x j folds optimally –Matrix is symmetric, so let i < j

The Nussinov algorithm Can be summarized into two cases: –(i, j) paired: optimal score is 1 + F(i+1, j-1) –(i, j) unpaired: optimal score is max k F(i, k) + F(k+1, j) k = i..j-1

The Nussinov algorithm F(i, i) = 0 F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) S(x i, x j ) = 1 if x i, x j can form a base-pair, and 0 otherwise –Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1 –Or other types of scores (later) F(1, N) gives the optimal score for the whole seq

How to fill in the DP matrix? F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) 0 0 0(i, j) i i+1 j–1j

How to fill in the DP matrix? F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) j – i = 1

How to fill in the DP matrix? F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) j – i = 2

How to fill in the DP matrix? F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) j – i = 3

How to fill in the DP matrix? F(i+1, j-1) + S(x i, x j ) F(i, j) = max max k=i..j-1 F(i, k) + F(k+1, j) j – i = N - 1

Minimum Loop length Sharp turns unlikely Let minimum length of hairpin loop be 1 (3 in real preds) F(i, i+1) = U  A G  C C  G G C

Algorithm Initialization: F(i, i) = 0;for i = 1 to N F(i, i+1) = 0;for i = 1 to N-1 Iteration: For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N) F(i+1, j -1) + s(x i, x j ) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) Termination: Best score is given by F(1, N) (For trace back, refer to the Durbin book)

Complexity For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N) F(i+1, j -1) + s(x i, x j ) F(i, j) = max max{ i  k < j } F(i, k) + F(k+1, j) Time complexity: O(N 3 ) Memory: O(N 2 )

Example RNA sequence: GGGAAAUCC Only count # of base-pairs –A-U = 1 –G-C = 1 –G-U = 1 Minimum hairpin loop length = 1

G G G A A A U C C G G G A A A U C CG G G A A A U C C

G G G A A A U C CG G G A A A U C C

G G G A A A U C CG G G A A A U C C

G G G A A A U C CG G G A A A U C C

G G G A A A U C CG G G A A A U C C A  U G  C G AA G  U G  C AAA A  U G G  C AA

G G G A A A U C C G G G A A A U C CG G G A A A U C C A  U G  C G AA G  U G  C AAA A  U G G  C AA

G G G A A A U C C G G G A A A U C CG G G A A A U C C A  U G  C G AA G  U G  C AAA A  U G G  C AA

G G G A A A U C C G G G A A A U C CG G G A A A U C C A  U G  C G AA G  U G  C AAA A  U G G  C AA

Energy minimization For L = 1 to N-1 For i = 1 to N – l j = min(i + L, N); E(i+1, j -1) + e(x i, x j ) E(i, j) = min min{ i  k < j } E(i, k) + E(k+1, j) e(x i, x j ) represents the energy for x i base pair with xj Energy are negative values. Therefore minimization rather than maximize. More complex energy rules: energy depends on neighboring bases

More realistic energy rules UU AA A A A GC GC GC UA AU CG AU 4nt hairpin , Terminal mismatch of hairpin -2.9, stacking -2.9, stacking (special for 1nt bulge) -1.8, stack -0.9, stack -1.8, stack -2.1, stack 5’ 3’ 5’-dangle, -0.3 unstructured, 0 Overall  G = -4.6 kcal/mol 1nt bulge, +3.3 Complete energy rules at

The Zuker algorithm – main ideas 1.Instead of base pairs, pairs of base pairs (more accurate) 2.Separate score for bulges 3.Separate score for different-size & composition of loops 4.Separate score for interactions between stem & beginning of loop 5.Use additional matrix to remember current state. e.g, to model stacking energy: W(i, j): energy of the best structure on i, j V(i, j): energy of the best structure on i, j given that i, j are paired Similar to affine-gap alignment.

Two popular implementations mfold (Zuker) RNAfold in the Vienna package (Hofacker)

Accuracy 50-70% for sequences up to 300 nt Not perfect, but useful Possible reasons: –Energy rule not perfect: 5-10% error –Many alternative structures within this error range –Alternative structure do exist –Structure may change in presence of other molecules

Comparative structure prediction To maintain structure, two nucleotides that form a base-pair tend to mutate together Given K homologous aligned RNA sequences: Human aagacuucggaucuggcgacaccc Mouse uacacuucggaugacaccaaagug Worm aggucuucggcacgggcaccauuc Fly ccaacuucggauuuugcuaccaua Orc aagccuucggagcgggcguaacuc If i th and j th positions are always base paired and covary, then they are likely to be paired

Mutual information f ab (i,j) : Prob for a, b to be in positions i, j f a (i) : Prob for a to be in positions i aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc f gc (3,13) = 3/5 f cg (3,13) = 1/5 f au (3,13) = 1/5 f g (3) = 3/5 f c (3) = 1/5 f a (3) = 1/5 f c (13) = 3/5 f g (13) = 1/5 f u (13) = 1/5

Mutual information Also called covariance score M is high if base a in position i always follow by base b in position j –Does not require a to base-pair with b –Advantage: can detect non-canonical base-pairs However, M = 0 if no mutation at all, even if perfect base-pairs aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc One way to get around is to combine covariance and energy scores

Comparative structure prediction Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP However, alignment is hard, since structure often more important than sequence

Comparative structure prediction In practice: 1.Get multiple alignment 2.Find covarying bases – deduce structure 3.Improve multiple alignment (by hand) 4.Go to 2 A manual EM process!!

Comparative structure prediction Align then fold Fold then align Align and fold

Context-free Grammar for RNA Secondary Structure S = SS | aSu | cSg | uSa | gSc | L L = aL | cL | gL | uL |  aaacgg ugcc ag u cg a c g g a g u g c c c g u S S S S L S L a L S La 

Stochastic Context-free Grammar (SCFG) Probabilistic context-free grammar Probabilities can be converted into weights CFG vs SCFG is similar to RG vs HMM S = SS S = aSu | uSa S = cSg | gSc S = uSg | gSu S = L L = aL | cL | gL | uL |  e(x i, x j ) + S(i+1, j-1) S(i, j) = max L(i, j) max k (S(i, k) + S(k+1, j)) L(i, j) = 0 0

SCFG Decoding Decoding: given a grammar (SCFG/HMM) and a sequence, find the best parse (highest probability or score) –Cocke-Younger-Kasami (CYK) algorithm (analogous to Viterbi in HMM) –The Nussinov and Zuker algorithms are essentially special cases of CYK –CYK and SCFG are also used in other domains (NLP, Compiler, etc).

SCFG Evaluation Given a sequence and a SCFG model –Estimate P(seq is generated by model), summing over all possible paths (analogous to forward- algorithm in HMM) Inside-outside algorithm –Analogous to forward-background –Inside: bottom-up parsing (P(x i..x j )) –Outside: top-down parsing (P(x 1..x i-1 x j+1..x N )) Can calculate base-paring probability –Analogous to posterior decoding –Essentially the same idea implemented in the Vienna RNAfold package

SCFG Learning Covariance model: similar to profile HMMs –Given a set of sequences with common structures, simultaneously learn SCFG parameters and optimally parse sequences into states –EM on SCFG –Inside-outside algorithm –Efficiency is a bottleneck Have been successfully applied to predict tRNA genes and structures –tRNAScan

Summary: SCFG and HMM algorithms GOALHMM algorithmSCFG algorithm Optimal parseViterbiCYK EstimationForwardInside BackwardOutside LearningEM: Fw/BckEM: Ins/Outs Memory ComplexityO(N K)O(N 2 K) Time ComplexityO(N K 2 )O(N 3 K 3 ) Where K: # of states in the HMM # of nonterminal symbols in the SCFG

Open research problems ncRNA gene prediction ncRNA regulatory networks Structure prediction –Secondary, including pseudoknots –Tertiary Structural comparison tools –Structural alignment Structure search tools –“RNA-BLAST” Structural motif finding –“RNA-MEME”