Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Prediction Preliminary Results Computational Genomics February 20, 2012.

Similar presentations

Presentation on theme: "Gene Prediction Preliminary Results Computational Genomics February 20, 2012."— Presentation transcript:

1 Gene Prediction Preliminary Results Computational Genomics February 20, 2012

2 ab initio Gene Prediction Using Glimmer3, RAST, Prodigal and GenemarkS

3 Prodigal lack of complexity(no Hidden Markov Model, no Interpolated Markov Model). based on dynamic programming. remains accuracy in high GC content genomes. tends to predict longer genes rather than more genes.

4 Prodigal Protocol


6 Prodigal Options

7 Build Training File

8 Running Prodigal

9 Screenshot of Results

10 GeneMarkS Gene prediction in Prokaryotic genome with unsupervised model parameter estimation


12 Web based version

13 Command line version Syntax: runGeneMarkS The Output folder contains 3 types of files:.out file: contains the default output.faa file: contains the amino acid sequence of the corresponding ORFs in FASTA format.fnn file: contains the nucleotide sequence of the corresponding ORFs in FASTA format

14 Strand +:normal strand, -:reverse strand Left end: Begin position, Right end: End position Screenshot of the.out file

15 Screenshot of the.faa file

16 Screenshot of the.fnn file

17 Glimmer3 A system for finding genes in microbial DNA Works by creating a variable-length Markov model from a training set of genes Using the model to identify all genes in a DNA sequence

18 Running Glimmer3 2 step progress 1. A probability model of coding sequences must be built called an interpolated context model. – a set of training sequences – 1. genes identified by homology or known genes – 2. from long, overlapping orfs – 3. genes from a highly similar species 2. program is run to analyze the sequences and make gene predictions – Best results require longest possible training set of genes

19 Glimmer3 programs Long-orfs  uses an amino-acid distribution model to filter the set of orfs Extract  builds training set from long, nonoverlapping orfs Build-icm  build interpolated context model from training sequences Glimmer3  analyze sequences and make predictions

20 Interpolated Context Model



23 RAST RAST (Rapid Annotation using Subsystem Technology) is a system for annotating bacterial and archaeal genomes. Pipelines- tRNAScan-SE, Glimmer2, and comparing against other prokaryote genes that are universal across species.


25 Number Genes Predicted IDGlimmer3ProdigalRASTGenemark M191071728 17841808 M195011914186720151933 M211272370231724562413 M216211937191418381972 M216392698266528232797 M217091924188120041925 Average2095206221532141

26 Gene Length of Predicted Genes IDGlimmer3RASTGeneMark M19107791.43793.56801.50 M19501806.71809.12840.52 M21127987.09692.20708.70 M21621851.47900.93885.61 M21639740.28751.85762.46 M21709840.49843.18873.15 Average836.25798.47811.99

27 Homology-based Gene Prediction using BLAT


29 M19107.fastaM19501.fastaM21127.fastaM21621.fastaM21639.fasta M21709.fasta 991729244931 1709 Protein coding genes Haemophilus influenzae Query Haemophilus haemolyticus Targets Output.pslx QueryCoverage (%) Frequency graphs Define cutoff Predicted genes Blat-UCSC Homology-based Gene Prediction using BLAT

30 Cut-off Query-Coverage % Frequency

31 StrandContigs Query- coverage CUTOFF (%) Predicted genes Average Lenght M1910799907871049 M1950117901063996 M211272990901963 M216212490930685 M2163949909701277 M21709*31901515813 Homology-based Gene Prediction using BLAT Results

32 M19107M19501 M21127M21621M21639 M21709* 7871063901 930 970 1515 Gene Calling Protocol N° of Predicted Genes (≥ 90% Query-coverage) Gene Scoring System Presence / Absence ≥ 4/5 = 3/5 ≤ 2/5 Multiple Alignment (Muscle) Consensus Sequence Final set of homology- based predicted genes ?

33 RNA Prediction

34 First pass filters identify "candidate" tRNA regions of the sequence. tRNAscan and EufindtRNA Further analysis to confirm the initial tRNAprediction. Cove

35 tRNAscan-SE –B -o -f -m -B : search for bacterial tRNAs This option selects the bacterial covariace model for tRNA analysis, and loosens the search parameters for EufindtRNA to improve detection o f bacterial tRNAs. -o : save final results in Specifiy this option to write results to. -f : save results and tRNA secondary structures to. -m : save statistics summary for run contains the run options selected as well as statistics on the number of tRNAs detected at each phase of the search, search speed, and other statistics.

36 Output using “–o” parameter Output using “–f” parameter


38 M19107M19501M21127M21621M21639M21709 No. of contigs991729234929 Contigs with atleast 1 tRNA 451222193321 First-pass tRNAs predicted 103124114123137113 Cove- confirmed tRNAs 4151505251 Output using “–m” parameter


40 RNAmmer

41 Working It works using two level of Hidden markov models. The spotter model is constructed from highly conserved loci within a structural alignment of known rRNA sequences. Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions.

42 Command line options Rnammer -S (species) –m (molecules) –xml (xml file) –gff (gff file) –h (hmm report file) –f (fasta file) -S : specify the species to use. In out case, it will be bacterial -m : molecules to search for. (ie. Large subunit or small subunit)

43 ##gff-version2 ##source-version RNAmmer-1.2 ##date 2012-02-19 ##Type DNA # seqname source feature start end score +/- frame attribute # --------------------------------------------------------------------------------------------------------- 84RNAmmer-1.2rRNA28110310063556.4+.23s_rRNA 84RNAmmer-1.2rRNA311273124182.9+.5s_rRNA 1RNAmmer-1.2rRNA11696911708382.9-.5s_rRNA 60RNAmmer-1.2rRNA33845282.9+.5s_rRNA 29RNAmmer-1.2rRNA19831282.9+.5s_rRNA 84RNAmmer-1.2rRNA25977275071872.9+.16s_rRNA # --------------------------------------------------------------------------------------------------------- M19107411 M19501711 M21127410 M21621400 M21639721 M21709822 Results

44 sRNA Prediction

45 Rfam Database Homology Search A collection of RNA families – Non-coding RNA genes – Structured cis-regulatory elements – Self-splicing RNAs WU-BLAST search, and keeps hits with E-value < 1e-5

46 Rfam Preliminary Results Accession # Total ncRNA # of rRNA # of tRNA / tmRNA # of sRNA Others (RNasep) Sequencing Coverage M1910765104311112 X M1950185145317153 X M211277995217120 X M2162181105416125 X M2163995125329178 X M2170992165421134 X The output format is: Results: 84 Rfam similarity 25970 27512 1477.28 +. evalue=2.08e-50;gc- content=52;id=SSU_rRNA_bacteria.1;model_end=1518;model_start=1;rfam-acc=RF00177;rfam- id=SSU_rRNA_bacteria

47 Things to be done Get Geneprimp to work since we are having some problems with the installation and the web server takes a long time to process. Get further information required to run other RNA prediction softwares. Compare specific RNA prediction softwares with Rfam predictions.

48 Leading Biocomputational Tools eQRNA (Rivas and Eddy 2001) RNAz (Washietl et al. 2005; Gruber etal. 2010) sRNAPredict3/SIPHT (Livny et al. 2006, 2008) NAPP (Marchais et al. 2009) Lu, X., H. Goodrich-Blair, et al. (2011). "Assessing computational tools for the discovery of small RNA genes in bacteria." RNA 17(9): 1635-1647 All four approaches use comparative genomics!!

49 sRNApredict3 Pipeline

Download ppt "Gene Prediction Preliminary Results Computational Genomics February 20, 2012."

Similar presentations

Ads by Google