Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
Metabarcoding 16S RNA targeted sequencing
Advertisements

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
A Comprehensive Workflow for Microbial Genome Sequencing From Swab to Publication Madison I. Dunitz 1, David A. Coil 1, Jenna M. Lang 1, Guillaume Jospin.
Metagenomics Binning and Machine Learning
Metagenomic Analysis Using MEGAN4
Discussion on Metagenomic Data for ANGUS Course Adina Howe.
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
Discovery of new biomarkers as indicators of watershed health and water quality Anamaria Crisan & Mike Peabody.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Copyright © 2009 Pearson Education, Inc. Genomics, Bioinformatics, and Proteomics Chapter 21 Lecture Concepts of Genetics Tenth Edition.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
 Read quality  Adaptor trimming  Read sequence collapse Preprocessing Genome mapping  Map read to the spruce genome (Pabies1.0- genome.fa) using Patman
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
2009 IADR, MIAMI, FL, USA Hands-on Experience for using the Human Oral Microbiome Database (HOMD) 2009 IADR Workshop, Miami, FL, USA Tsute (George) Chen.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Metagenome analysis Natalia Ivanova MGM Workshop February 2, 2012.
Accurate estimation of microbial communities using 16S tags
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Module 5: Future 1 Canadian Bioinformatics Workshops
Copyright OpenHelix. No use or reproduction without express written consent1 1.
Metagenomic dataset preprocessing – data reduction
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Shotgun sequencing reveals transkingdom alterations in immunodeficiency associated enteropathy Xiaoxi Dong (Oregon State University), Jialu Hu (Oregon.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.
Canadian Bioinformatics Workshops
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Functional profiling with HUMAnN2
16S rRNA Experimental Design
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Canadian Bioinformatics Workshops
Alastair Grant Environmental Sciences, University of East Anglia
Metagenomic Species Diversity.
Metagenomics: From Bench to Data Analysis 19-23rd September S rRNA-based surveys for Community Analysis: How Quantitative are they? Dr.
Quality Control & Preprocessing of Metagenomic Data
Research in Computational Molecular Biology , Vol (2008)
The FASTQ format and quality control
Human Gut Microbiome: Function Matters
Microbiome: Metagenomics
Comparative Genomics.
Example usage of mockrobiota MC resource for marker gene and metagenome sequencing pipelines. Example usage of mockrobiota MC resource for marker gene.
Basic Local Alignment Search Tool
A typical current computational meta'omic pipeline to analyze and contrast microbial communities. A typical current computational meta'omic pipeline to.
Genome resolved metagenomics
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 4 Metagenomic Functional Composition

Module 4 bioinformatics.ca Learning Objectives of Module Determine the difference between functional composition and taxonomic composition Have a general understanding of different functional databases Understand the pros and cons of assembling and gene calling with metagenomic data Be able to functionally annotate your metagenomic sample using HUMANN. Be able to determine statistically significant differences in functional abundance using STAMP.

Module 4 bioinformatics.ca Functional Composition Taxonomic composition answers “Who is there?” Functional composition answers “What are they doing?” Metagenomics provides the opportunity to catalog the set of genes from an entire community

Module 4 bioinformatics.ca What do we mean by function? General categories – Photosynthesis – Nitrogen metabolism – Glycolysis Specific groups of orthologs – Nifh – EC: (alchohol dehydrogenase) – K00929 (butyrate kinase)

Module 4 bioinformatics.ca Various Functional Databases COG – Well known but original classification not updated since 2003 SEED – Used by the RAST and MG-RAST systems PFAM – Focused more on protein domains EggNOG – Very comprehensive (~190k groups) UniRef – Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50 – Most comprehensive and is constantly updated KEGG – Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” – Full access now requires a license fee MetaCyc – Is starting to take replace KEGG – More microbe focused than KEGG

Module 4 bioinformatics.ca KEGG We will focus on using the KEGG database during this workshop KEGG Orthologs (KOs) – Most specific. Thought to be homologs and doing the same exact “function” – ~12,000 KOs in the database – These can be linked into KEGG Modules and KEGG Pathways, – Identifiers: K01803, K00231, etc.

Module 4 bioinformatics.ca KEGG (cont.) KEGG Modules – Manually defined functional units – Small groups of KOs that function together – ~750 KEGG Modules – Identified: M00002, M00011, etc.

Module 4 bioinformatics.ca KEGG (cont.) KEGG Pathways – Groups KOs into large pathways (~230) – Each pathway has a graphical map – Individual KOs or Modules can be highlighted within these maps – Pathways can be collapsed into very general functional terms (e.g. Amino Acid Metabolism, Carbohydrate Metabolism, etc.)

Module 4 bioinformatics.ca Metagenomic Annotation Systems Web-based – (All of these options provide functional and taxonomic analysis, plus hosts your data.) – EBI Metagenomics Server – MG-RAST – IMG/M GUI-Based – MEGAN Allows connection between taxonomy and function – ClovR Virtual Machine based, contains SOP, hasn’t been updated recently Local-based – MetAMOS Built in assembly, highly customizable, some features can be buggy – DIY Set up your own in-house custom computational pipeline – Humann

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca Humann Step 1 Reads are searched against a protein database (e.g. KEGG) – This is done separate from the actual running of humann. – Can use BLASTX, but much faster methods now available (e.g. DIAMOND)

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca Humann Step 2 Normalize and weight search results The relative abundance of each KO is calculated: – Number of reads mapping to a gene sequence in that KO – Weighted by the inverse p-value of each mapping – Normalized by the average length of the KO

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca Humann Step 3 Reduce number of pathways A KO can map to one or more KEGG Pathways – Just because a KO is found in a pathway doesn’t mean that it exists in the community – If a pathway has 20 KOs and only 2 KOs are observed in the community (but at high abundances) what should be the abundance of the pathway? – MinPath (Ye, 2009) attempts to estimate the abundance of these pathways and remove spurious noise

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca Humann Step 4 Reduce false positive pathways further and normalize by KO copy number Using the organism information from the KEGG hits – Pathways that are not found to be in any of the observed organisms AND are made up mostly of KOs mapping to a different pathway are removed – KO abundance can be divided by the estimated copy number of that KO as observed from the KEGG organism database

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca Humann Step 5 Smoothing pathways by gap filling – Sequencing depth or poor sequence searches could lead to some KOs within pathways being absent or in low abundance – KOs with 1.5 interquartile ranges below the pathway median were raised to the pathway median

Module 4 bioinformatics.ca Humann

Module 4 bioinformatics.ca What about assembly? Assembly is often used in genomics to join raw reads into longer contigs and scaffolds

Module 4 bioinformatics.ca Assembly for Metagenomics? Pros – Less computation time for annotation – Can allow annotation when reads are too short (<100bp) – Can sometimes partially reconstruct genomes Cons – Reads are not all from the same genome so chimeras can be formed – Read depth is often not as deep as in genomics which makes assembly fail – High organism diversity can cause assembly to fail (subsampling may help) – If calculating abundance of genes then reads collapsed by assembly must be added back in post-annotation (MetAMOS does a good job of this) – Can bias results since some organisms/genes will assemble easier which will result in those features being falsely over-represented

Module 4 bioinformatics.ca What about gene calling? In genomics, normally you would predict the start and stop positions of genes using a gene prediction program before annotating the genes In metagenomics: – Pros: May result in less false positives from annotating “non-real” genes Lowers the number of annotation comparisons later on – Cons No good learning dataset Raw reads will not cover an entire gene Often requires assembled data – Possible tools: FragGeneScan, MetaGeneAnnotator – Alternative: Do 6 frame-translation (e.g. BLASTX)

Module 4 bioinformatics.ca Community Function Potential Important that this is metagenomics, not metatranscriptomics, and not metaproteomics These annotations suggest the functional potential of the community The presence of these genes/functions does not mean that they are biologically active (e.g. may not be transcribed)

Module 4 bioinformatics.ca Microbiome Helper – Provides scripts that help automate and combine different tools together into a bioinformatics workflow – Provides up-to-date and step-by-step documentation for processing 16S and metagenomic data – Well tested and is flexible based on new emerging tools –

Module 4 bioinformatics.ca Sample 1Sample 2Sample 3 OTU 1402 OTU 2100 OTU S rRNA gene QIIME Shotgun Metagenomics HUMAnN Sample 1Sample 2Sample 3 K K K MetaPhlAn PICRUSt STAMP

Module 4 bioinformatics.ca IMR Integrated Microbiome Resource – Offers sequencing and bioinformatics for microbiome projects –

DNA extraction 16S (V6-V8) or 18S (V4) PCR Gel verification PCR clean-up & library normalization Illumina MiSeq sequencing Microbiome Amplicon Sequencing Workflow CGEB-IMR.ca DalhousieU March 2015  Method/kit appropriate to specific samples (ex: stool, urine, etc.)  Invitrogen E-gel 96-well high-throughput method  Invitrogen SequalPrep 96-well high- throughput method  bp paired-end reads  ~25 M reads = ~15 Gb  ~65 k reads/sample (for 384)  Duplicate with template dilutions  Multiplexing to 384 samples/run  Only 1 PCR w/fusion primers: QC (16S / 18S amplicons on the Illumina MiSeq) QC Quality-control check/step i5 index F primer R primer i7 index P5 adapterP7 adapter 16S/18S sequence Time = 0.5 d Time = 1 d Time = 1 h Time = 1.5 h Time = ~3 d Total Time = 5 d approx.

Module 4 bioinformatics.ca

Module 4 bioinformatics.ca Questions?

Module bioinformatics.ca We are on a Coffee Break & Networking Session