Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008

Advancing Science with DNA Sequence 1.Introduction 2.Tools out there 3.Basic principles behind tools 4.Known problems of the tools: why you may need manual curation Outline

Advancing Science with DNA Sequence 1.I ntroduction (who said annotating prokaryotic genomes is easy?) 2.T ools out there 3.B asic principles behind tools 4.K nown problems of the tools: why you may need manual curation Outline

Advancing Science with DNA Sequence Sequence features in prokaryotic genomes:  stable RNA-coding genes (rRNAs, tRNAs, RNA component of RNaseP, tmRNA)  protein-coding genes (CDSs)  transcriptional features (mRNAs, operons, promoters, terminators, protein-binding sites, DNA bends)  translational features (RBS, regulatory antisense RNAs, mRNA secondary structures, translational recoding and programmed frameshifts, inteins)  pseudogenes (tRNA and protein-coding genes)  … Finding the genes in microbial genomes Well-annotated bacterial genome in Artemis genome viewer: rRNA tRNA operon promoter terminator protein-coding gene CDS protein-binding site

Advancing Science with DNA Sequence 1.Introduction (who said annotating prokaryotic genomes is easy?) 2.Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3.Basic principles behind tools 4.Known problems of the tools: why you may need manual curation Outline

Advancing Science with DNA Sequence IMG-ER http://img.jgi.doe.gov/er IMG-ER submission page: http://durian.jgi-psf.org/~imachen/cgi-bin/Submission/main.cgi RAST http://rast.nmpdr.org/ JCVI Annotation Service http://www.tigr.org/tigr-scripts/AnnotationEngine/ann_engine.cgi Output: rRNAs and tRNAs, CDSs, functional annotations output in several formats Tools out there: servers for microbial genome annotation - I Output: stable RNA-encoding genes, CDSs, functional annotations output in GenBank format Output: CDSs, stable RNAs? functional annotations format?

Advancing Science with DNA Sequence Tools out there: servers for microbial genome annotation - II REGANOR https://www.cebitec.uni-bielefeld.de/groups/brf/software/reganor/ RefSeq http://www.ncbi.nlm.nih.gov/genomes/MICROBES/genemark.cgi http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi EasyGene http://www.cbs.dtu.dk/services/EasyGene/ Output: rRNAs and tRNAs, CDSs, output in gff format Output: CDSs, output in tbl format Output: CDSs, size restriction <1Mb

Advancing Science with DNA Sequence Artemis http://www.sanger.ac.uk/Software/Artemis/ Manatee http://manatee.sourceforge.net/ Argo http://www.broad.mit.edu/annotation/argo/ Major difference: viewer vs editor? Tools out there: genome browsers for manual annotation of microbial genomes Linux versions only; genome needs to be annotated by the JCVI Annotation Service Windows and Linux versions; works with files in many formats, annotated by any pipeline Windows and Linux; works with files in many formats

Advancing Science with DNA Sequence Large structural RNAs (23S and 16S rRNAs) The only known tool is search_for_RNAs script developed by Niels Larsen and available upon request; used by IMG, RAST and REGANOR annotation servers Small structural RNAs (5S rRNA, tRNAs, tmRNA, RNaseP RNA component) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ ARAGORN http://130.235.46.10/ARAGORN1.1/HTML/aragorn1.2.html tRNAScan-SE http://lowelab.ucsc.edu/tRNAscan-SE/ Tools out there: tools for finding stable (“non-coding”) RNAs - I Web service: sequence search is limited to 2 kb Web service: sequence search is limited to 15 kb, finds tRNAs and tmRNAs only Web service: sequence search is limited to 5 Mb, finds tRNAs only

Advancing Science with DNA Sequence Short regulatory RNAs (riboswitches, etc.) Rfam database, INFERNAL search tool http://www.sanger.ac.uk/Software/Rfam/ http://rfam.janelia.org/ http://infernal.janelia.org/ Other (less popular) tools: Pipeline for discovering cis-regulatory ncRNA motifs: http://bio.cs.washington.edu/supplements/yzizhen/pipeline/ RNAz http://www.tbi.univie.ac.at/~wash/RNAz/ Tools out there: tools for finding “non-coding” RNAs - II Web service: sequence search is limited to 2 kb; Provides list of pre- calculated RNAs for publicly available genomes

Advancing Science with DNA Sequence Reading frames: translations of the nucleotide sequence with an offset of 0, 1 and 2 nucleotides (three possible translations in each direction) Open reading frame (ORF): reading frame between a start and stop codon Tools out there: finding protein- coding genes (not ORFs!)

Advancing Science with DNA Sequence Tools out there: most popular CDS-finding tools CRITICA Glimmer family (Glimmer2, Glimmer3, RBS finder) http://glimmer.sourceforge.net/ GeneMark family (GeneMark-hmm, GeneMarkS) http://exon.gatech.edu/GeneMark/ EasyGene Combinations and variations of the above REGANOR (CRITICA + Glimmer3 + pre-processing) Old ORNL pipeline (CRITICA + Glimmer3) RAST (Glimmer2 + pre- and post-processing)

Advancing Science with DNA Sequence Tools out there: metagenome annotation BLASTx Fgenesb http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=programs&s ubgroup=gfindb GeneMark (GeneMark-hmm for reads, GeneMarkS for longer contigs) http://exon.gatech.edu/GeneMark/ MetaGene http://metagene.cb.k.u-tokyo.ac.jp/metagene/ GISMO ? http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Full-service servers IMG/M-ER – uses GeneMark for Sanger, proxygenes for 454 http://img.jgi.doe.gov/submit MG-RAST http://metagenomics.nmpdr.org/

Advancing Science with DNA Sequence 1.Introduction (who said annotating prokaryotic genomes is easy?) 2.Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3.Basic principles behind tools (very basic, see specific papers for details) 4.Known problems of the tools: why you may need manual curation Outline

Advancing Science with DNA Sequence Basic principles: finding RNAs  16S and 23S rRNAs are too large to be represented as statistical models (cannot use secondary structure in prediction)  Approximate location is identified by sequence similarity search (BLASTn) against a database of 16S and 23S rRNA sequences (but GenBank contains too many fragments)  Boundaries of these RNAs are usually defined as the coordinates of a BLAST hit (but alignment is local)  For small RNAs statistical models can be generated  Both sequence similarity AND secondary structure are taken into account RNAs are grouped into families by sequence similarity Multiple sequence alignment is performed Secondary structure annotation is added Combined secondary structure and primary sequence profiles are captured by profile stochastic context-free grammars (statistical models) These models are stored in the database and new sequences are searched against them

Advancing Science with DNA Sequence Basic principles: finding CDSs using evidence-based vs ab initio algorithms Two major approaches to prediction of protein-coding genes: “evidence-based” (ORFs with translations homologous to the known proteins are CDSs) Advantages: finds “unusual” genes (e. g. horizontally transferred); relatively low rate of false positive predictions Limitations: cannot find “unique” genes; low sensitivity towards short genes; prone to propagation of false positive results of ab initio annotation tools ab initio (ORFs with nucleotide composition similar to CDSs are also CDSs) Advantages: finds “unique” genes; high sensitivity Limitations: often misses “unusual” genes; high rate of false positives

Advancing Science with DNA Sequence Basic principles: finding protein- coding genes with ab initio methods  An ORF that is likely to be protein- coding is found by searching for “coding potential” “Coding potential” is defined by comparing nucleotide sequence of an ORF to a hidden Markov model (HMM) HMM is generated using a training set from the genome or from average frequencies observed for multiple genomes Probability that an ORF is a protein-coding gene is computed  N-terminal (5’) boundary is found by finding a start codon (ATG, GTG, TTG) next to a ribosomal binding site (RBS, Shine-Dalgarno sequence) Different genomes have different frequencies of start codons RBS is found by (Gibbs sampling) multiple sequence alignment of upstream sequences and represented by a weighted positional frequency matrix Or RBS is found by multiple sequence alignment and represented as one of the states in an HMM model Or... Example: the overall HMM architecture used in EasyGene (from Larsen & Krogh, BMC Bioinformatics, 2003).

Advancing Science with DNA Sequence Features and differences between gene finding tools Training set selection (evidence-based vs purely ab initio) Example: CRITICA and EasyGene use evidence-based training sets (BLASTn with counting synonymous/non-synonymous codons in CRITICA, BLASTx in EasyGene); Glimmer uses ab initio training set of long non-overlapping ORFs; GeneMark uses ab initio heuristic model Statistical model of coding and non-coding regions (codon frequencies, dicodon frequencies, hidden Markov models) Example: CRITICA uses dicodon frequencies to model coding regions; Glimmer uses interpolated Markov models (IMM) of up to 5-th order; GeneMark uses order 2 hmm for coding regions, order 0 hmm for non-coding regions Statistical model architecture (i. e. which parts of the CDS are explicitly modeled – may include RBS, spacer region, start codon, second codon, internal codons, stop codon, etc.) Example: EasyGene explicitely models RBS, spacer region, start codon, second codon, internal codons, stop codon, codons surrounding stop codon, non-coding sequence; all other tools have less comprehensive architectures of HMM Additional algorithms for refinement of predictions (RBS finder, overlap resolution, etc.) Example: Glimmer2.0 has a scoring schema for overlap resolution; Glimmer3. uses a dynamic programming algorithm to select the highest-scoring set of predictions consistent with the maximum allowed overlap

Advancing Science with DNA Sequence 1.Introduction (who said annotating prokaryotic genomes is easy?) 2.Tools out there (don’t bother to write down the names and links, all presentations will be available on the web site) 3.Basic principles behind tools (very basic, see specific papers for details) 4.Known problems of the tools: why you may need manual curation (more on manual curation in the next talk by Thanos Lykidis) Outline

Advancing Science with DNA Sequence Known problems: RNAs GenomeSequencing center 16S rRNA, nt Synechococcus sp. CC9311UCSD, TIGR1477 Synechococcus sp. CC9605JGI1440 Synechococcus elongatus PCC 7942 JGI1490 Synechococcus sp. JA-2- 3BA(2-13) TIGR1323 Synechococcus sp. JA-3-3AbTIGR1324 Synechococcus sp. RCC307Genoscope1498 Synechococcus sp. WH7803Genoscope1497, 1464  Large rRNAs: approximate position is correct, boundaries are inaccurate variation of rRNA sizes in closely related strains most 16S rRNA are missing anti- Shine-Dalgarno sequence  Small structural RNAs: covariance models are generally accurate, but may miss some tRNAs in Archaea Check for the full complement of tRNAs with all necessary anti- codons No model for pyrrolysine tRNA  Small regulatory RNAs: search is accurate but slow (too many models) Annotations of regulatory RNAs are missing from many genomes

Advancing Science with DNA Sequence Known problems: CDSs  Short CDSs: many are missed, others are overpredicted short ribosomal proteins (30-40 aa long) are often missed short proteins in the promoter region are often overpredicted  N-terminal sequences are often inaccurate (many features of the sequence around start codon are not accounted for) Glimmer2.0 is calling genes longer than they should be GeneMark, Glimmer3.0 err both ways, but mostly call genes shorter  Pseudogenes all tools are looking for ORFs (needs valid start and stop codons) “unique” genes are often predicted on the opposite strand of a pseudogene  Proteins with unusual translational features (recoding, programmed frameshifts) these genes are often mistaken for pseudogenes see pseudogenes

Advancing Science with DNA Sequence Known problems: different gene finding tools applied to the same genome total featurestotal CDSs non-pseudo CDSspseudototal rRNAtotal tRNA total misc RNA manual71247042669934318631 GeneMark70596974 018622 ORNL70766994 018631 RAST55035422 018630 REGANOR64206339 018630 Glimmer38218 0000 missed by automated annotation false positive (deleted by manual curation) too short (extended by manual curation) too long (truncated by manual curation) total modifications #% CDSs# # # # GeneMark3474.92824.084612.01061.5158122.4 ORNL6108.65607.93004.2115216.3262237.2 RAST178325.31672.3831.1222831.6426160.5 REGANOR90412.82032.82072.9105915.0237333.6 Glimmer32373.3140819.95007.15427.6268738.1

Advancing Science with DNA Sequence Conclusions There are plenty of tools for automated annotation of microbial genome, including several “full-service” servers and annotation pipelines Even “full-service” pipelines identify a limited range of features and development of automated or semi-automated tools for identification of operons, promoters, terminators etc. is highly desirable, but not likely in the absence of experimental data Nearly all of the annotation tools and servers are using different strategies, algorithms, models, settings, etc., so the results may and will vary Different automated gene finders have different advantages and limitations; the best strategy is using any of them or a combination followed by evidence-based manual curation

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.

Similar presentations

Presentation on theme: "Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.

Similar presentations

Presentation on theme: "Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008."— Presentation transcript:

Similar presentations

About project

Feedback