GASiC: Metagenomic abundance estimation and diagnostic testing on species level Martin Lindner, Bernhard Renard NG 4, Robert Koch-Institut.

Slides:



Advertisements
Similar presentations
16S sequencing for microbiome studies Nicola Segata and Nick Loman
Advertisements

Tucson High School Biotechnology Course Spring 2010.
Greg Phillips Veterinary Microbiology
Transcriptomics Jim Noonan GENE 760.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
GTL Facilities Characterization and Imaging of Molecular Machines Lee Makowski.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
mRNA-Seq: methods and applications
The Microbiome and Metagenomics
Introduction to metagenomics Agnieszka S. Juncker Center for Biological Sequence Analysis Technical University of Denmark.
Metagenomics Binning and Machine Learning
Metagenomic Analysis Using MEGAN4
Molecular Microbial Ecology
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Protein analysis and proteomics (Part 2 of 2). Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Species  OTUs  OPUs  Species  OTUs  OPUs. Rosselló-Mora & Amann 2001, FEMS Rev. 25:39-67 Taxa circumscription depends on the observable characters.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments Isaam Saeed & Saman K Halgamuge MERIT,
Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Next Generation DNA Sequencing
The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.
Copyright © 2009 Pearson Education, Inc. Genomics, Bioinformatics, and Proteomics Chapter 21 Lecture Concepts of Genetics Tenth Edition.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis
1 Web Site: Dr. G P S Raghava, Head Bioinformatics Centre Institute of Microbial Technology, Chandigarh, India Prediction.
Genomics and Forensics
The metagenomics sequencing service CD Genomics. Metagenomics: Metagenomics is the study of metagenomes, genetic material recovered directly from environmental.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Accurate estimation of microbial communities using 16S tags
No reference available
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Canadian Bioinformatics Workshops
Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models Arthur Brady and Steven L. Salzberg Nature Methods 6(9):
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Metagenomic Species Diversity.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Metagenomic assembly Cedric Notredame
Research in Computational Molecular Biology , Vol (2008)
Toward Next Generation Biodiversity Research
Rod Eyles1, John Juma1, Morag Ferguson1, Trushar Shah1 1 IITA, Nairobi
2nd (Next) Generation Sequencing
H = -Σpi log2 pi.
Taxonomic identification and phylogenetic profiling
HPC for large NGS data: Microbial diversity analysis
Community diversity and metagenome depth interact to influence assembly quality. Community diversity and metagenome depth interact to influence assembly.
Framework for integrating taxonomic and metabolomic data.
(A) Amphibian SourceTracker analysis reveals that water is the only sample type that obtains a notable amount of its microbial community from amphibian.
Toward Accurate and Quantitative Comparative Metagenomics
General overview of the bioinformatic pipelines for the 16S rRNA gene microbial profiling and shotgun metagenomics. General overview of the bioinformatic.
Presentation transcript:

GASiC: Metagenomic abundance estimation and diagnostic testing on species level Martin Lindner, Bernhard Renard NG 4, Robert Koch-Institut

Contents Motivation – What is Metagenomics? – Focus: Abundance Estimation GASiC Method – Mapping – Genome Similarity Estimation – Similarity Correction Comparison, Application Technical Details – Current Status – GASiC and SeqAn

What is Metagenomics? vs. Purified Escherichia coli [Rocky Mountain Laboratories, NIAID, NIH] Lake Washington Microbes [Dennis Kunkel Microscopy, Inc.] Analysis of genomic material directly taken from environmental samples. + Identify contributors of special functions + Study interaction of microbes + Estimate microbial diversity - Highly complex samples - Mostly unknown organisms - High spatial/temporal variability

Metagenomic Communities Low Complexity High Complexity Bioreactor Acid mine drainage Hydrothermal vents Lake Lanier (USA) Human microbiome Famous polar bear Soil Marine sediments Number of Microbial Species:

Bioinformatics in Metagenomics Genome assembly Gene/function prediction Taxonomic profiling Interaction networks  Focus on Taxonomic profiling: Who is out there? And, how many?

Taxonomic Profiling Reference based Composition based High accuracy Narrow focus Low accuracy Broad focus Diversity Estimation Exploration & Assembly Comparative Metagenomics Abundance Estimation Clinical Applications

Genome Abundance Estimation Goal: Estimate relative abundance of organisms from metagenomic sequence reads Problems: (Reference genome unknown) Unequal genome lengths Genomic Similarity Buchnera aphidicola:0.64 M bp Streptomyces bingchenggensis:11.9 M bp ???

GASiC Method

1. Read Mapping Chose suitable read mapper Map reads against reference genomes – Each genome separately – Does it match? Yes/No Write results to SAM-files

2. Similarity Estimation Similarity matrix: j ia ij a ij = Probability that a read from genome i can be mapped to genome j How to obtain a ij : Simulate N reads from genome i (e.g. with Mason) Map reads to genome j with same mapper/settings as in 1. Count the number of mapped reads r ij a ij = r ij /r ii A =

3. Similarity Correction Linear Model: Matrix notation: Linear Algebra lecture:

Non-negative LASSO Non-negative LASSO [Renard et al.] Approximate solution: Solve with standard solver for constrained optimization GASiC: COBYLA from scipy package

Comparison Metagenomic FAMeS dataset: [Mavromatis et al.] 113 microbial species 3 datasets with different complexities 100,000 Sanger reads (1000bp) per dataset Ground truth available Comparison by Xia et al.

Application Viral recombination data: [Moore et al.] – 4 viruses with 80%-96% sequence similarity – Abundance estimates from biological experiments

Technical Details Language: Python – Use scipy/numpy packages Platform: Linux (native) Interfaces (command line) to: – Read simulator (e.g. Mason [Holtgrewe] ) – Read mapper (e.g. bowtie [Langmead et al.] )

Similarity Correction Mapping Similarity Estimation Technical Details Mapper ReadsGenomes SAM Simulator Sim. Reads Mapper SAM Similarity Matrix Abundance Estimates write read write read read+write

GASiC & SeqAn Avoid disk IO! Integrate all modules in one tool Abandon dependences on external tools  SeqAn looks like a suitable framework!

Example: Similarity Matrix Current implementation: 1.Simulate 100,000 reads and write to fastq file 2.Read file and map to ref. genome, write results to SAM file 3.Read SAM file and count the number of matching reads The SeqAn way: 1.Simulate 1 read and map to ref. genomes; count if read mapped 2.Repeat 100,000 times

References Method: Lindner,M.S. and Renard,B.Y. (2012) Metagenomic abundance estimation and diagnostic testing on species level. Nucl. Acids Res., doi: /nar/gks803. Renard,B.Y. et al. (2008) NITPICK: peak identification for mass spectrometry data. BMC Bioinformatics, 9, 355. Datasets: Mavromatis,K. et al. (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 4, 495–500. Moore,J. et al. (2011) Recombinants between Deformed wing virus and Varroa destructor virus-1 may prevail in Varroa destructor-infested honeybee colonies. J. Gen. Virol., 92, pp 156–161. Related Methods: Huson,D. et al. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386. Xia,L. et al. (2011) Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS One, 6, e External Tools: Langmead,B. et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25. Holtgrewe,M. (2010) Mason – a read simulator for second generation sequencing data. Technical report TR-B Institut für Mathematik und Informatik, Freie Universität Berlin.

Acknowledgements Research Group Bioinformatics (NG4) Bernhard Renard Franziska Zickmann Martina Fischer Robert Rentzsch Anke Penzlin Mathias Kuhring Sven Giese