Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Annotating a Scarlet Runner Bean genome fragment put together by shotgun sequencing Scarlet Runner ean Max Bachour.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Heuristic alignment algorithms and cost matrices
Sequencing and Sequence Alignment
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Single DNA Sequence Analysis Tools BME 110: CompBio Tools Todd Lowe May 6, 2008.
Lecture 12 Splicing and gene prediction in eukaryotes
Sequence comparison: Local alignment
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Sequencing a genome and Basic Sequence Alignment
Metagenomics Binning and Machine Learning
BLAST What it does and what it means Steven Slater Adapted from pt.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop September 16, 2008.
From Genomes to Genes Rui Alves.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics Lecture to accompany BLAST/ORF finder activity
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Identification and Enumeration of Waterfowl using Neural Network Techniques Michael Cash ECE 539 Final Project 12/19/03.
How can we find genes? Search for them Look them up.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Codon Bias and its Relationship to Gene Expression Presented through a virtual grant by the Virtual Student Union.
Construction of Substitution matrices
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
What is BLAST? Basic BLAST search What is BLAST?
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Metagenomic Species Diversity.
Bacterial infection by lytic virus
Basics of BLAST Basic BLAST Search - What is BLAST?
Research in Computational Molecular Biology , Vol (2008)
Sequence comparison: Local alignment
Ab initio gene prediction
Predicting Genes in Actinobacteriophages
What do you with a whole genome sequence?
Microbial gene identification using interpolated Markov models
Applying principles of computer science in a biological context
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Presentation transcript:

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern and Peter Meinicke José Lugo-Martínez I609-Week 9 th March 9, 2010 Orphelia: predicting genes in metagenomic sequencing reads Katharina J. Hoff, Thomas Lingner, Peter Meinicke and Maike Tech

Outline  Introduction  Background  Orphelia  Effect of sequencing errors  Conclusion

Metagenomics (revisited)  Simultaneously characterize all single species genomes of a particular habitat  Without prior cultivation!!  Phylogenetic origin may be unknown  Identification of protein coding genes location of genes unknown!!  Identification of metabolic pathways

So far, we have discussed … Sequencing Who is there? What are they doing? Phylogenetic profilingFunctional profiling Environmental Sample … Today

What’s the Problem?  Don’t know phylogenetic origin Most reads cannot be assembled into longer contigs How to assemble reads?  This implies: Analysis of single sequencing reads  But ORF-based will overlook most reads  Need gene prediction approaches for metagenomics Fast and accurate

Possible Approaches (1)  Homology based BLAST search against databases of known proteins BLAST search against sample Clustering of sample and database sequences  Limited to already known genes, and/or computationally expensive

Possible Approaches (2)  Model-based Methods GeneMark - derives an adapted monocodon usage model from GC-content MetaGene – extracts ORFs and scores them, then, calculates the final ORF combination from different scores FragGene Scan - Mina Rho (IU) Orphelia –fragment-oriented based on a two-stage machine learning approach:  linear discriminants and neural networks

Orphelia Overview Added in 2 nd paper

Pipeline Score all Candidates Likely Genes Likely Random “ORFs” Selection of Candidates Final Prediction Extract “ORFs”

“ORF” Extraction  Begin start codon (ATG, CTG, GTG, or TTG)  Followed by >18 subsequent triplets  End stop codon (TGA, TAG, or TAA)  But, also consider incomplete “ORFs” of length ≥60bp that lack start and/or stop codon

“ORFs” Identification STOPTIS

Scoring of Candidate

Step 1: Linear Discriminants  Feature preprocessing Training Linear Discriminants  Example: Monocodon Linear Discriminant

Step 2: Neural Network  Input: Feature vector x :=  Output: Gene probability of being coding “ORF”  Training of Orphelia Versions Net300 (a.k.a Orphelia 300)  Trained on 300bp fragments for predicting genes (454 reads) Net700 (a.k.a. Orphelia 700)  Trained on 700bp fragments for predicting genes (Sanger reads)

Gene Candidate Selection Algorithm Initially: C i = all “ORFs” along with their gene probability for fragment i (p > 0.5) G i = ϕ (empty) Selection Algorithm(C i, G i ) while C i not empty 1. determine “ORFs” with highest probability w/respect to all “ORFs” in C i 2. remove selected “ORF” from C i and add it to G i 3. remove all “ORFs” from C i that overlap with selected “ORF” by more than o max bp Result: G i = list of genes for fragment i

Performance  use fragments with known annotation  compare prediction to annotation SensitivitySpecificity TP - reading frame and/or stop codon of prediction match annotation FP - predicted gene does not occur in annotation FN - annotated gene was not predicted

Test Species  Randomly excised fragments to 1x genome coverage from annotated genomes

Results

Sensitivity on different fragments lengths

Specificity on different fragments lengths

Web server:

Limitations  Do not annotate rRNA and tRNA genes  All model-based methods are susceptible to sequencing errors

Effect of sequencing errors  Traditional gene prediction methods subject to a benchmark study on real sequencing reads with typical errors  However such a comparison has not been conducted for specialized tools Gene prediction accuracy mostly measured on error-free DNA fragments

Two major sequencing techniques  Sanger sequencing Avg read length of ~700nt Error rates from 0.001% to > 1% (depends on software used for post-processing of reads)  Pyrosequencing Shorter reads ~450nt Error rate of 0.49% for reads of nt Metagenomics simulation software MetaSim produces reads with an error rate of 2.8%

Accuracy Results

Accuracy Results by GC-content

Conclusions  Orphelia high gene prediction accuracy on short DNA fragments high gene prediction specificity  Accounting for realistic sequencing error rates will significantly influence prediction performance

Additional References   K. J. Hoff, T. Lingner, P. Meinicke, M. Tech, “Orphelia: predicting genes in metagenomic sequencing reads”, Nucleic Acids Research, 2009, 37, W101–W105.  K. J. Hoff, “The effect of sequencing errors on metagenomic gene prediction”, BMC Genomics, 2009, 10:520.  H. Noguchi, J. Park, T. Takagi, “MetaGene: prokaryoticgene finding from environmental genome shotgun sequences”, Nucleic Acids Research, 2006, 34(19), 5623–5630.  J. Besemer, M. Borodovsky, “Heuristic approach to deriving models for gene finding”, Nucleic Acids Research, 1999, 27(19),

Questions?