Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Similar presentations


Presentation on theme: "Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern."— Presentation transcript:

1 Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke José Lugo-Martínez I609-Week 9 th March 9, 2010 Orphelia: predicting genes in metagenomic sequencing reads Katharina J. Hoff, Thomas Lingner, Peter Meinicke and Maike Tech

2 Outline  Introduction  Background  Orphelia  Effect of sequencing errors  Conclusion

3 Metagenomics (revisited)  Simultaneously characterize all single species genomes of a particular habitat  Without prior cultivation!!  Phylogenetic origin may be unknown  Identification of protein coding genes location of genes unknown!!  Identification of metabolic pathways

4 So far, we have discussed … Sequencing Who is there? What are they doing? Phylogenetic profilingFunctional profiling Environmental Sample … Today

5 What’s the Problem?  Don’t know phylogenetic origin Most reads cannot be assembled into longer contigs How to assemble reads?  This implies: Analysis of single sequencing reads  But ORF-based will overlook most reads  Need gene prediction approaches for metagenomics Fast and accurate

6 Possible Approaches (1)  Homology based BLAST search against databases of known proteins BLAST search against sample Clustering of sample and database sequences  Limited to already known genes, and/or computationally expensive

7 Possible Approaches (2)  Model-based Methods GeneMark - derives an adapted monocodon usage model from GC-content MetaGene – extracts ORFs and scores them, then, calculates the final ORF combination from different scores FragGene Scan - Mina Rho (IU) Orphelia –fragment-oriented based on a two-stage machine learning approach:  linear discriminants and neural networks

8 Orphelia Overview Added in 2 nd paper

9 Pipeline Score all Candidates Likely Genes Likely Random “ORFs” Selection of Candidates Final Prediction Extract “ORFs”

10 “ORF” Extraction  Begin start codon (ATG, CTG, GTG, or TTG)  Followed by >18 subsequent triplets  End stop codon (TGA, TAG, or TAA)  But, also consider incomplete “ORFs” of length ≥60bp that lack start and/or stop codon

11 “ORFs” Identification STOPTIS

12 Scoring of Candidate

13 Step 1: Linear Discriminants  Feature preprocessing Training Linear Discriminants  Example: Monocodon Linear Discriminant

14 Step 2: Neural Network  Input: Feature vector x :=  Output: Gene probability of being coding “ORF”  Training of Orphelia Versions Net300 (a.k.a Orphelia 300)  Trained on 300bp fragments for predicting genes (454 reads) Net700 (a.k.a. Orphelia 700)  Trained on 700bp fragments for predicting genes (Sanger reads)

15 Gene Candidate Selection Algorithm Initially: C i = all “ORFs” along with their gene probability for fragment i (p > 0.5) G i = ϕ (empty) Selection Algorithm(C i, G i ) while C i not empty 1. determine “ORFs” with highest probability w/respect to all “ORFs” in C i 2. remove selected “ORF” from C i and add it to G i 3. remove all “ORFs” from C i that overlap with selected “ORF” by more than o max bp Result: G i = list of genes for fragment i

16 Performance  use fragments with known annotation  compare prediction to annotation SensitivitySpecificity TP - reading frame and/or stop codon of prediction match annotation FP - predicted gene does not occur in annotation FN - annotated gene was not predicted

17 Test Species  Randomly excised fragments to 1x genome coverage from annotated genomes

18 Results

19 Sensitivity on different fragments lengths

20 Specificity on different fragments lengths

21 Web server: http://orphelia.gobics.de/ http://orphelia.gobics.de/

22 Limitations  Do not annotate rRNA and tRNA genes  All model-based methods are susceptible to sequencing errors

23 Effect of sequencing errors  Traditional gene prediction methods subject to a benchmark study on real sequencing reads with typical errors  However such a comparison has not been conducted for specialized tools Gene prediction accuracy mostly measured on error-free DNA fragments

24 Two major sequencing techniques  Sanger sequencing Avg read length of ~700nt Error rates from 0.001% to > 1% (depends on software used for post-processing of reads)  Pyrosequencing Shorter reads ~450nt Error rate of 0.49% for reads of 100-200nt Metagenomics simulation software MetaSim produces reads with an error rate of 2.8%

25 Accuracy Results

26 Accuracy Results by GC-content

27 Conclusions  Orphelia high gene prediction accuracy on short DNA fragments high gene prediction specificity  Accounting for realistic sequencing error rates will significantly influence prediction performance

28 Additional References  http://orphelia.gobics.de/ http://orphelia.gobics.de/  K. J. Hoff, T. Lingner, P. Meinicke, M. Tech, “Orphelia: predicting genes in metagenomic sequencing reads”, Nucleic Acids Research, 2009, 37, W101–W105.  K. J. Hoff, “The effect of sequencing errors on metagenomic gene prediction”, BMC Genomics, 2009, 10:520.  H. Noguchi, J. Park, T. Takagi, “MetaGene: prokaryoticgene finding from environmental genome shotgun sequences”, Nucleic Acids Research, 2006, 34(19), 5623–5630.  J. Besemer, M. Borodovsky, “Heuristic approach to deriving models for gene finding”, Nucleic Acids Research, 1999, 27(19), 3911-3920.

29 Questions?


Download ppt "Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern."

Similar presentations


Ads by Google