The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews.

Slides:



Advertisements
Similar presentations
Un percorso realizzato da Mario Malizia
Advertisements

B. Knudsen and J. Hein Department of Genetics and Ecology
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.
Part 3 Probabilistic Decision Models
Multiplication X 1 1 x 1 = 1 2 x 1 = 2 3 x 1 = 3 4 x 1 = 4 5 x 1 = 5 6 x 1 = 6 7 x 1 = 7 8 x 1 = 8 9 x 1 = 9 10 x 1 = x 1 = x 1 = 12 X 2 1.
Division ÷ 1 1 ÷ 1 = 1 2 ÷ 1 = 2 3 ÷ 1 = 3 4 ÷ 1 = 4 5 ÷ 1 = 5 6 ÷ 1 = 6 7 ÷ 1 = 7 8 ÷ 1 = 8 9 ÷ 1 = 9 10 ÷ 1 = ÷ 1 = ÷ 1 = 12 ÷ 2 2 ÷ 2 =
RNA Secondary Structure Prediction
1 1  1 =.
1  1 =.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science,
Chapter 7 Sampling and Sampling Distributions
The 5S numbers game..
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
1 Atomic Routing Games on Maximum Congestion Costas Busch Department of Computer Science Louisiana State University Collaborators: Rajgopal Kannan, LSU.
Oil & Gas Final Sample Analysis April 27, Background Information TXU ED provided a list of ESI IDs with SIC codes indicating Oil & Gas (8,583)
Thomas Jellema & Wouter Van Gool 1 Question. 2Answer.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Feature Selection 1 Feature Selection for Image Retrieval By Karina Zapién Arreola January 21th, 2005.
1 Joseph Ghafari Artificial Neural Networks Botnet detection for Stéphane Sénécal, Emmanuel Herbert.
MATHS FACTS YOU NEED TO KNOW
Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Select a time to count down from the clock above
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
1 Percentage: A commonly used relative quantity..
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
Predicting RNA Structure and Function
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Zhi John Lu, Jason Gloor, and David H. Mathews University of Rochester Medical Center, Rochester, New York Improved RNA Secondary Structure Prediction.
Comparative ab initio prediction of gene structures using pair HMMs
Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.
[Bejerano Fall10/11] 1.
Structural Alignment of Pseudoknotted RNAs Banu Dost, Buhm Han, Shaojie Zhang, Vineet Bafna.
An Investigation into Selection Constraints in RNA Genes Naila Mimouni, Rune Lyngsoe and Jotun Hein Department of Statistics, Oxford University Aim A robust.
Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Non-coding RNA gene finding problems. Outline Introduction RNA secondary structure prediction RNA sequence-structure alignment.
Strand Design for Biomolecular Computation
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 6:
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Progress toward Predicting Viral RNA Structure from Sequence: How Parallel Computing can Help Solve the RNA Folding Problem Susan J. Schroeder University.
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features.
[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.
Central dogma: the story of life RNA DNA Protein.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Tracking down ncRNAs in the genomes. How to find ncRNA gene The stability of ncRNA secondary structure is not sufficiently different from the predicted.
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
Rapid ab initio RNA Folding Including Pseudoknots via Graph Tree Decomposition Jizhen Zhao, Liming Cai Russell Malmberg Computer Science Plant Biology.
Poster Design & Printing by Genigraphics ® Esposito, D., Heitsch, C. E., Poznanovik, S. and Swenson, M. S. Georgia Institute of Technology.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
bacteria and eukaryotes
Halfway Feedback (yours)
CS273A Lecture 3: Non Coding Genes MW 12:50-2:05pm in Beckman B100
RNA Secondary Structure Prediction
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
Evaluating Classifiers for Disease Gene Discovery
Presentation transcript:

The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Uzilov, Keegan, Mathews. BMC Bioinformatics In Press.

Outline: •Background in ncRNA. •Basic hypothesis. •The Dynalign algorithm for prediction of an RNA secondary structure common to two sequences. •Using Dynalign to find ncRNA sequences in genomes. •Optimizing Dynalign performance.

Central Dogma of Biology:

RNA is an Active Player:

What is ncRNA? •Non-coding RNA (ncRNA) is an RNA that functions without being translated to a protein. •Known roles for ncRNAs: –RNA catalyzes excision/ligation in introns. –RNA catalyzes the maturation of tRNA. –RNA catalyzes peptide bond formation. –RNA is a required subunit in telomerase. –RNA plays roles in immunity and development (RNAi). –RNA plays a role in dosage compensation. –RNA plays a role in carbon storage. –RNA is a major subunit in the SRP, which is important in protein trafficking. –RNA guides RNA modification.

AAUUGCGGGAAAGGGGUCAA CAGCCGUUCAGUACCAAGUC UCAGGGGAAACUUUGAGAUG GCCUUGCAAAGGGUAUGGUA AUAAGCUGACGGACAUGGUC CUAACCACGCAGCCAAGUCC UAAGUCAACAGAUCUUCUGU UGAUAUGGAUGCAGUUCA Predicting RNA Secondary and 3D Structure from Sequence: Cate, et al. (Cech & Doudna). (1996) Science 273:1678. Waring & Davies. (1984) Gene 28: 277.

An RNA Secondary Structure: On average, 46 % of nucleotides are unpaired. R2 Retrotransposon 3’ UTR from D. melanogaster. RNA 3:1-16.

Gibb’s Free Energy (  G°): K i = = = K i /K j =  G° quantifies the favorability of a structure at a given temperature.

Nearest Neighbor Model for RNA Secondary Structure Free Energy at 37 O C: Mathews, Disney, Childs, Schroeder, Zuker, & Turner PNAS 101: 7287.

How is the Lowest Free Energy Structure Determined? •Naïve approach would be to calculate the free energy of every possible secondary structure. •Number of secondary structures  1.8 N (where N is the number of nucleotides) •The free energies of 1000 structures can be calculated in 1 second. •For 100 nucleotide sequence: –Number of secondary structures  3 × –Time to calculate  years

Dynamic Programming Algorithm: •Not to be confused with molecular dynamics. •This is a calculation – not a simulation. •The lowest free energy structure is guaranteed given the nearest neighbor parameters used. •Reviewed by Sean Eddy. Nature Biotechnology : 1457.

Dynamic Programming Algorithm: •Named by Richard Bellman in •Applies to calculations in which the cost/score is built progressively from smaller solutions. •Other applications –Sequence alignment –Determining partition functions for RNA secondary structures –Finding shortest paths –Determining moves in games –Linguistics

Dynamic Programming: •Recursion is used to speed the calculation. –The problem is divided into smaller problems. –The smaller problems are used to solve bigger problems. •Two Step Process –Fill – determines the lowest free energy folding possible for each subsequence –Traceback – determined the structure that has the lowest free energy

RNA Secondary Structure Prediction Accuracy: Percentage of Known Base Pairs Correctly Predicted: Mathews, Disney, Childs, Schroeder, Zuker, & Turner PNAS 101: 7287.

Pseudoknot: i < i’ < j < j’

Hypothesis: •ncRNAs have lower folding free energy change than non-structural sequences, e.g. mRNA, or random sequences. •Corollary: –ncRNAs, which are structured, can be found in genomic sequences because they have folding free energy change lower than background sequences.

Do Structural RNAs have Lower Folding Free Energy Change than Background? •Yes: –Le et al NAR 18:1613. –Seffens & Digby NAR 27:1578. –Clote et al RNA 11:578. •No: –Workman & Krogh NAR 27:418. –Rivas & Eddy Bioinformatics 16:583.

Test of Hypothesis: ncRNA (tRNA or 5S rRNA) Negative (First order Markov chain that preserves dinucleotide frequencies) 100 Control Sequences 100 Control Sequences (First order Markov chain that preserves dinucleotide frequencies)

Calculate Z Score of Folding Free Energy Change for Positives and Negatives: •Calculate the mean,, and standard deviation, , for the controls. •Z score is the number of standard deviations that a negative or positive’s free energy change is different from mean: Z = (  G  37 - )/  •Choose a Z-score cutoff for classification as ncRNA.

Scoring: •Sensitivity = (True Positives)/(True Positives + False Negatives) = percent of ncRNA correctly classified as ncRNA •Specificity = (True Negatives)/(True Negatives + False Positives) = percent of non-ncRNA correctly classified as non-ncRNA Sequence is ncRNA: Sequence is not ncRNA: Sequence is predicted to be ncRNA: True PositiveFalse Positive Sequence is predicted to not be ncRNA: False NegativeTrue Negative

Distribution of Z Scores: Count

R eceiver-Operator Characteristic (ROC) Curve:

Why do Structural RNA Sequences Not Have a Significantly Lower Folding Free Energy Change? •Hypothesis is incorrect. •Secondary structure prediction has limited accuracy: –Kinetics may play a role in folding. –Free energy nearest neighbors are based on a limited number of experiments and have error. –The algorithms that are used for these studies cannot predict pseudoknots (non-nested pairs).

Dynalign (a 4-D Dynamic Programming Algorithm): Mathews & Turner. Journal of Molecular Biology. 317: (2002) Mathews. Bioinformatics. 21: (2005) Algorithm for Secondary Structure Prediction (2D dynamic programming algorithm) Algorithm for Sequence Alignment (2D dynamic programming algorithm) Simultaneously finds the sequence alignment and thermodynamically favorable common secondary structure for two sequences. Dynalign requires no sequence identity.

Inputs, Optimization, and Outputs: Input:Sequence 1Sequence 2 Optimization (minimize  G° total ):  G° total =  G° sequence 1 +  G° sequence 2 + (  G° gap )(number of gaps) Output: Sequence Alignment, Structure of 1, Structure of 2 where each helix in 1 must be homologous to a BP in 2

Seven 5S rRNAs with secondary structures predicted with 47.8% average accuracy. Average of all 42 pair-wise combinations predicted by Dynalign. Optimization of  Gº gap :

Improving the Accuracy of tRNA Secondary Structure Prediction: RD0260 RE6781 Conventional Free Energy Minimization Predicted Structures:

Improving the Accuracy of tRNA Secondary Structure Prediction: Dynalign Predicted Structures: RD0260 RE6781 RD0260 GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCA RE6781 UCCGUCGUAGUCUAGGUGGUUAGGAUACUCGGCUCUCACCCGAGAGAC-CCGGGUUCGAGUCCCGGCGACGGAACCA ^^^^^^^ ^^^^ ^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^^^^^^^^

Benchmarks: •Four databases: –All pairwise comparisons (21) of seven 5S sequences with widely varying accuracy of secondary structure prediction using a single sequence. –3 calculations with 6 srp sequences. –All pairwise calculations (780) with 40 randomly chosen tRNA sequences. –All pairwise comparisons (105) of 15 randomly chosen 5S rRNA sequences.

Sensitivity: Sensitivity = (Correctly Predicted Pairs)/(Total Known Pairs)

Improving Dynalign Performance: •The original restriction on the alignments is: |i – k| ≤ M –For the 3’ ends of the sequence to align: M ≥ | N 1 – N 2 | –For most applications, the ends of the sequences should align. •This suggests an alternative restriction: |i N 2 /N 1 – k | ≤ M –This allows a smaller M parameter. Calculation time scales O(N 3 M 3 ).

Heuristic to Exclude Base Pairs: •There are many possible canonical base pairs that are not worth considering because any structure that contains them has a high free energy. •The “high energy” base pairs can be identified by secondary structure prediction using a single sequence (very fast). The high energy pairs can then be excluded from a Dynalign structure prediction.

% of Known Pairs within a % Energy Increment from the Lowest Free Energy Structure:

Time Performance Improvement: Sequence 1: Sequence 2: N1:N1:Original M: Original Time (hr:min): Revised M: Revised Time (hr:min): RD0260RE :2260:01 H. volcanii 5S A. Globiform- is 5S :1160:03 D. takashii R2 3’ UTR D. melano- gaster :0580: GHz Intel Pentium 4 with 1 GB RAM; Red Hat Enterprise Linux 3; gcc compiler

Revised Hypothesis: •Dynalign calculated folding free energies for sequence pairs derived from genome alignments can be used to find ncRNAs with high sensitivity and specificity.

Testing the Hypothesis: ncRNA pair (tRNAs or 5S rRNAs) Negative pair (Shuffle of global alignment) 20 Control Sequence Pairs 20 Control Sequence Pairs (Shuffle of global alignment)

Dynalign ROC Curve has Larger Integral than Single Sequence:

ROC Curves Depend on M:

ROC Curves for tRNA and 5S rRNA:

Comparison to Other State of the Art Methods: •QRNA: –Rivas & Eddy BMC Bioinformatics 2:8. –Comparative analysis of aligned sequences, where compensating base pairs changes indicate ncRNA. Classification by stochastic context-free grammar. •RNAz: –Washietl, Hofacker, & Stadler PNAS 102: –Folding free energy of two or more aligned sequences using RNAalifold. Classification by support vector machine (SVM). •Both Methods Use Fixed Alignments: –Faster than Dynalign. –Limited to sequence alignment algorithm (compensating base pair changes make accurate alignment difficult).

QRNA Sequence Types:

Dynalign vs. RNAz:

What About Low Sequence Identity Pairs?

Human vs. Mouse Alignment (Santa Cruz Genome Server) Pairwise Identities for 50 Nucleotide Windows: % IdentityNumber of WindowsPercent of Windows 0 ≤ i < ≤ i < ≤ i < ≤ i < ≤ i < ≤ i < ≤ i < ≤ i < ≤ i < ≤ i ≤ Total:

Faster Method Using Dynalign: •Run a single calculation and use a support vector machine (SVM) to classify sequence as ncRNA or not. –Each window only needs to be scanned once. –A probability is assigned to the classification. •SVM –Trained with tRNA and 5S rRNA sequences. –Input: •Dynalign total free energy change •Length of the shorter sequence •A,C,G content of each sequence

ROC of SVM vs. 20 Controls:

Dynalign-SVM vs. RNAz at Low Identity:

Unrolling the Method on E. coli: •Look for ncRNA in E. coli using alignments to S. typhi. –MUMmer (Kurtz et al Genome Biol 5:R12) •15,214 blocks of 50 to 150 nucleotides as above (where long alignment blocks were divided into 150 nucleotide windows that overlap 75 nucleotides)

ncRNA Detection: DynalignRNAzQRNA Number of Known ncRNAs found E. coli (156 ncRNAs known) S. typhi (110 ncRNAs known) Number of hits that are not known ncRNAs (likely false positives) E. coli S. typhi

Epilogue: Improving Dynalign Performance: •In collaboration with Gaurav Sharma, Electrical and Computer Engineering, University of Rochester, and Arif Harmanci, we pre-determine the sequence alignment probabilities with a Hidden Markov Model. •Then, we only allow alignments in Dynalign that have probability greater than –This removes the need of using the M parameter heuristic. –This does not affect the accuracy of structure prediction by Dynalign.

Benchmarks Against Other Programs Using 2000 Pairs of 5S rRNA Sequences: Algorithm:Percent Sequence Identity: 20-40:40-60:60-80:80-100:All: Dynalign FOLDALIGN StemLoc Consan Single Sequence Percent of Known Pairs Correctly Predicted:

Performance Benchmarks Using 200 Pairs of Sequences: Algorithm:Time (s):Memory (MB): tRNA:5S rRNA:tRNA:5S rRNA: Dynalign FOLDALIGN StemLoc Consan Using a single core on a dual, dual-core Opteron 270 machine running Fedora Core 5 and gcc

Parallelizing Dynalign for SMP: •In collaboration with Paul Tymann, Computer Science, Rochester Institute of Technology and CS students Chris Connett, Glenn Katzen, Andrew Yohn, we developed an SMP version of Dynalign. •This takes advantage of the fact that there are a number of positions in the arrays that can be filled independently in the dynamic programming algorithm recursions.

Scaling: Two R2 3’ UTRs of length 234 and 217 nucleotides. Using a dual, dual-core Opteron 270 machine running Fedora Core 5 and gcc

Preliminary Results with SMP-Dynalign: •Single sequence secondary structure prediction of E. coli 16S rRNA (1542 nucleotides) has 43.6% sensitivity. •E. coli 16S rRNA run on Dynalign with: –B. subtilis 16S rRNA (1552 nucleotides) has 80.7% sensitivity and required 381 minutes on 4 cores and 983 MB or RAM. –Borrelia burgodorferi 16S rRNA (1532 nucleotides) has 76.4% sensitivity and required 408 minutes on 4 cores and 1.0 GB of RAM.

Conclusions: •The folding free energy of single sequences does not provide a sensitive and specific method of finding ncRNAs. It does, however, provide a pre- filtering method that can remove 30% of sequences from consideration. •Dynalign shows promise as a method for ncRNA detection, especially at low pairwise identities of sequences.

Acknowledgements: •Past Lab Members: –Andrew Uzilov –Shan Zhao –Eliany Sanchez-Baez •Lab Members: –Sumeet Chandha –Zhi Lu –Matthew Seetin –Rahul Tyagi –Keith VanNostrand •Funding: –Alfred P. Sloan Foundation –National Institutes of Health •Computing: –CASCI Lab at Rochester Institute of Technology

MUMmer:

WuBLASTn: