Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.

Slides:

Advertisements

Similar presentations

Gene Prediction: Similarity-Based Approaches

Advertisements

B. Knudsen and J. Hein Department of Genetics and Ecology

RNA Secondary Structure Prediction

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.

Gene Prediction Preliminary Results Computational Genomics February 20, 2012.

1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry

Profiles for Sequences

درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.

Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.

Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.

Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)

Predicting RNA Structure and Function

RNA structure prediction. RNA functions RNA functions as –mRNA –rRNA –tRNA –Nuclear export –Spliceosome –Regulatory molecules (RNAi) –Enzymes –Virus –Retrotransposons.

RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc.

Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.

Searching genomes for noncoding RNA CS374 Leticia Britos 10/03/06.

Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.

HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Comparative ab initio prediction of gene structures using pair HMMs

Predicting RNA Structure and Function. Nobel prize 1989Nobel prize 2009 Ribozyme Ribosome RNA has many biological functions The function of the RNA molecule.

Noncoding RNA Genes Pt. 2 SCFGs CS374 Vincent Dorie.

Similar Sequence Similar Function Charles Yan Spring 2006.

Structural Alignment of Pseudoknotted RNAs Banu Dost, Buhm Han, Shaojie Zhang, Vineet Bafna.

CISC667, F05, Lec19, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) RNA secondary structure.

Multiple Sequence Alignment

Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Project No. 4 Information discovery using Stochastic Context-Free Grammars(SCFG) Wei Du Ranjan Santra May

Predicting RNA Structure and Function. Nobel prize 1989 Nobel prize 2009 Ribozyme Ribosome.

A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?

MicroRNA identification based on sequence and structure alignment Presented by - Neeta Jain Xiaowo Wang†, Jing Zhang†, Fei Li, Jin Gu, Tao He, Xuegong.

CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.

© Wiley Publishing All Rights Reserved. RNA Analysis.

PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.

Improving the prediction of RNA secondary structure by detecting and assessing conserved stems Xiaoyong Fang, et al.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

Structural Alignment of Pseudo-knotted RNA

Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.

Sequence Alignment.

Doug Raiford Phage class: introduction to sequence databases.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

Motif Search and RNA Structure Prediction Lesson 9.

Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.

Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.

MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.

(H)MMs in gene prediction and similarity searches.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

bacteria and eukaryotes

Genome Annotation (protein coding genes)

Stochastic Context-Free Grammars for Modeling RNA

Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.

Stochastic Context-Free Grammars for Modeling RNA

N-Gram Model Formulas Word sequences Chain rule of probability

Presentation transcript:

Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff Bonis CISC841 - Bioinformatics

What Are Non-Coding RNAs (ncRNA)? “functional molecules that do not code for proteins” Examples: transfer RNA (tRNA), spliceosomal RNA, microRNA, regulatory RNA elements Over 100 known ncRNA families

Secondary Structure of ncRNAs Conserved, therefore useful for identifying homologs Secondary structure is functionally important to RNAs Base pairing important in pattern searching e.g. 16s RNA - part of small subunit of prokaryote ribosome

What Techniques Exist? Two models that predict homologs in ncRNA families –Covariance Models (CMs) –Easy RNA Profile IdentificatioN (ERPIN) - Both use multiple alignment of family members with secondary structure annotation Statistical model is built from this multiple alignment Display high sensitivity and low specificity

What about ERPIN? DP algorithm matches the statistical profile onto a target database and returns the solutions and their scores Cannot take into account non-consensus bulges in helices (caused by indels) Need user specified score thresholds which compromises accuracy

CMs “specify a tree-like SCFG arcitecture suited for modelling consensus RNA secondary structures.” Can’t accommodate pseudoknots Very slow algorithm

Which model should be improved? Covariance Model (CM) is chosen because it’s limitation, pseudoknots, contain little information anyway Address slow speed without sacrificing accuracy CMs used in Rfam - –8 gigabase genome DB called RFAMSEQ –Takes over a year to search for tRNA on P4 –Over 100 ncRNA families

Previous improvements on speed BLAST based heuristic –Known members are BLASTed against RFAMSEQ –CM is run on resulting set BLAST misses family members, especially where there is low sequence conservation tRNAscan-SE –Uses 2 heuristic based programs for tRNA searches –CM is used on resulting set –May miss tRNAs that CMs would find

How to improve sensitivity? Authors previously developed rigorous filters with 100% sensitivity of CM found set Filters based on profile HMMs –Profile HMM is built from CM then run on DB –Much of DB is filtered out, CM runs on remaining set HMM filter based on sequence conservation –Scanned for 126 of 139 ncRNA families in Rfam –Other 13 display low sequence conservation, but have strong conservation of secondary structure which HMM can’t take into account –Heuristic methods also miss these ncRNAs

How can these special biological situations be accounted for? Authors propose 3 innovations to overcome these setbacks –2 techniques to include secondary structure information in filtering at expense of CPU time Sub-CMs –Hybrid filtering composed of CMs and profile HMMs Store-Pair –Uses additional HMM states for modeling key base pairs –Third techique will help reduce scan time Runs filters in series with quickest first ending with most selective Shortest path problem

Results Techniques worked for 11 of the 13 previously missed Rfams –Also found new hits missed by BLAST In tRNAscan-SE, provided rigorous scan for 3 of 4 CMs finding missed hits 100 times faster than raw CM on average Uncovers members missed by heuristics

What are CMs anyway? “statistical models that can detect when a positional sequence and secondary structure resemble a given multiple RNA alignment” Described in terms of stochastic context-free grammars (SCFGs) Transformational Grammars –Rules: describe grammar of the form Si -> xL Si+1 xR, xL and xR are left and right nucleotide –Terminals: symbols in the actual string (nucleotides) –Non-Terminals: abstract symbols (states) –Parse: series of steps to obtain final output Example: –RNA molecules CAG or GAC –S1 -> c S2 g | g S2 C; S2 -> a –Parse: S1 -> c S2 g -> cag

How are CM’s used? Each rule is assigned a probability –Rules more consistent w/ family have higher probability The probability of a parse is the product of all the probability of the rules it used CMs use a log-odds ratios and sum the scores instead of multiplying CM Viterbi requires window length input which upper bounds the family member’s length and affects scan time

How are profile HMMs and CMs combined? Given a CM, a profile HMM is created whose Viterbi score upper bounds the CM’s Viterbi score –Guarantees 100% sensitivity on CM Filtering: –At each nucleotide position in the subsequences of the database, a HMM is used to compute the CM score upper bound –A CM scan is applied to all subsequences that produce an upper bound exceeding some threshold –Subsequences that are below the threshold are filtered out. Profile HMMs are represented by regular grammars which cannot emit paired nucleotides, e.g. –CM: S1 -> a S2 u | c S S2 G; S2 -> e –HMM: S1L -> a S2L | C S2L; S2L -> S1R; S1R-> g | u A CM is expanded into a left and right HMM

How can these be supplemented? Selecting an optimal series of filters –Filtering fraction (fraction of DB left over) and run time are given by running an filter on a training sequence –Minimize expected total CPU time –Assumptions: estimated fractions and CPU times are constant for all training sequences A filter’s fraction is not affected by the previously run filters Optimal sequence of filters is solved as a shortest graph problem –nodes are filters and the CM –Weight of edges are CPU time

Sub-CM technique Exploit info in hairpins (bulges and internal loops) Much info is stored in short hairpins that need only part of the CMs states Grammar contains both HMM and CMs Window length of sub-CM is crucial HMMs are created manually after sub-CMs are found –Automation of this is a future project

Store-pair technique A HMM with extra states can reflect base pairs S1L[C] -> gS1L[C] has score neg. inf. 5 states are added per HMM state, but can be reduced