Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 4 Metagenomic Functional Composition

4 Module 4 bioinformatics.ca Learning Objectives of Module Determine the difference between functional composition and taxonomic composition Have a general understanding of different functional databases Understand the pros and cons of assembling and gene calling with metagenomic data Be able to functionally annotate your metagenomic sample using HUMANN. Be able to determine statistically significant differences in functional abundance using STAMP.

5 Module 4 bioinformatics.ca Functional Composition Taxonomic composition answers “Who is there?” Functional composition answers “What are they doing?” Metagenomics provides the opportunity to catalog the set of genes from an entire community

6 Module 4 bioinformatics.ca What do we mean by function? General categories – Photosynthesis – Nitrogen metabolism – Glycolysis Specific groups of orthologs – Nifh – EC: 1.1.1.1 (alchohol dehydrogenase) – K00929 (butyrate kinase)

7 Module 4 bioinformatics.ca Various Functional Databases COG – Well known but original classification not updated since 2003 SEED – Used by the RAST and MG-RAST systems PFAM – Focused more on protein domains EggNOG – Very comprehensive (~190k groups) UniRef – Has clustering at different levels (e.g. UniRef100, UniRef90, UniRef50 – Most comprehensive and is constantly updated KEGG – Very popular, each entry is well annotated, and often linked into “Modules” or “Pathways” – Full access now requires a license fee MetaCyc – Is starting to take replace KEGG – More microbe focused than KEGG

8 Module 4 bioinformatics.ca KEGG We will focus on using the KEGG database during this workshop KEGG Orthologs (KOs) – Most specific. Thought to be homologs and doing the same exact “function” – ~12,000 KOs in the database – These can be linked into KEGG Modules and KEGG Pathways, – Identifiers: K01803, K00231, etc.

9 Module 4 bioinformatics.ca KEGG (cont.) KEGG Modules – Manually defined functional units – Small groups of KOs that function together – ~750 KEGG Modules – Identified: M00002, M00011, etc.

10 Module 4 bioinformatics.ca KEGG (cont.) KEGG Pathways – Groups KOs into large pathways (~230) – Each pathway has a graphical map – Individual KOs or Modules can be highlighted within these maps – Pathways can be collapsed into very general functional terms (e.g. Amino Acid Metabolism, Carbohydrate Metabolism, etc.)

11 Module 4 bioinformatics.ca Metagenomic Annotation Systems Web-based – (All of these options provide functional and taxonomic analysis, plus hosts your data.) – EBI Metagenomics Server – MG-RAST – IMG/M GUI-Based – MEGAN Allows connection between taxonomy and function – ClovR Virtual Machine based, contains SOP, hasn’t been updated recently Local-based – MetAMOS Built in assembly, highly customizable, some features can be buggy – DIY Set up your own in-house custom computational pipeline – Humann

12 Module 4 bioinformatics.ca Humann

13 Module 4 bioinformatics.ca Humann Step 1 Reads are searched against a protein database (e.g. KEGG) – This is done separate from the actual running of humann. – Can use BLASTX, but much faster methods now available (e.g. DIAMOND)

14 Module 4 bioinformatics.ca Humann

15 Module 4 bioinformatics.ca Humann Step 2 Normalize and weight search results The relative abundance of each KO is calculated: – Number of reads mapping to a gene sequence in that KO – Weighted by the inverse p-value of each mapping – Normalized by the average length of the KO

16 Module 4 bioinformatics.ca Humann

17 Module 4 bioinformatics.ca Humann Step 3 Reduce number of pathways A KO can map to one or more KEGG Pathways – Just because a KO is found in a pathway doesn’t mean that it exists in the community – If a pathway has 20 KOs and only 2 KOs are observed in the community (but at high abundances) what should be the abundance of the pathway? – MinPath (Ye, 2009) attempts to estimate the abundance of these pathways and remove spurious noise

18 Module 4 bioinformatics.ca Humann

19 Module 4 bioinformatics.ca Humann Step 4 Reduce false positive pathways further and normalize by KO copy number Using the organism information from the KEGG hits – Pathways that are not found to be in any of the observed organisms AND are made up mostly of KOs mapping to a different pathway are removed – KO abundance can be divided by the estimated copy number of that KO as observed from the KEGG organism database

20 Module 4 bioinformatics.ca Humann

21 Module 4 bioinformatics.ca Humann Step 5 Smoothing pathways by gap filling – Sequencing depth or poor sequence searches could lead to some KOs within pathways being absent or in low abundance – KOs with 1.5 interquartile ranges below the pathway median were raised to the pathway median

22 Module 4 bioinformatics.ca Humann

23 Module 4 bioinformatics.ca What about assembly? Assembly is often used in genomics to join raw reads into longer contigs and scaffolds

24 Module 4 bioinformatics.ca Assembly for Metagenomics? Pros – Less computation time for annotation – Can allow annotation when reads are too short (<100bp) – Can sometimes partially reconstruct genomes Cons – Reads are not all from the same genome so chimeras can be formed – Read depth is often not as deep as in genomics which makes assembly fail – High organism diversity can cause assembly to fail (subsampling may help) – If calculating abundance of genes then reads collapsed by assembly must be added back in post-annotation (MetAMOS does a good job of this) – Can bias results since some organisms/genes will assemble easier which will result in those features being falsely over-represented

25 Module 4 bioinformatics.ca What about gene calling? In genomics, normally you would predict the start and stop positions of genes using a gene prediction program before annotating the genes In metagenomics: – Pros: May result in less false positives from annotating “non-real” genes Lowers the number of annotation comparisons later on – Cons No good learning dataset Raw reads will not cover an entire gene Often requires assembled data – Possible tools: FragGeneScan, MetaGeneAnnotator – Alternative: Do 6 frame-translation (e.g. BLASTX)

26 Module 4 bioinformatics.ca Community Function Potential Important that this is metagenomics, not metatranscriptomics, and not metaproteomics These annotations suggest the functional potential of the community The presence of these genes/functions does not mean that they are biologically active (e.g. may not be transcribed)

27 Module 4 bioinformatics.ca Microbiome Helper – Provides scripts that help automate and combine different tools together into a bioinformatics workflow – Provides up-to-date and step-by-step documentation for processing 16S and metagenomic data – Well tested and is flexible based on new emerging tools – https://github.com/mlangill/microbiome_helper

28 Module 4 bioinformatics.ca Sample 1Sample 2Sample 3 OTU 1402 OTU 2100 OTU 3242 16S rRNA gene QIIME Shotgun Metagenomics HUMAnN Sample 1Sample 2Sample 3 K00001201518 K00002120 K00003454 MetaPhlAn PICRUSt STAMP

29 Module 4 bioinformatics.ca IMR Integrated Microbiome Resource – Offers sequencing and bioinformatics for microbiome projects – http://cgeb-imr.ca

30 DNA extraction 16S (V6-V8) or 18S (V4) PCR Gel verification PCR clean-up & library normalization Illumina MiSeq sequencing Microbiome Amplicon Sequencing Workflow CGEB-IMR.ca DalhousieU March 2015  Method/kit appropriate to specific samples (ex: stool, urine, etc.)  Invitrogen E-gel 96-well high-throughput method  Invitrogen SequalPrep 96-well high- throughput method  300+300 bp paired-end reads  ~25 M reads = ~15 Gb  ~65 k reads/sample (for 384)  Duplicate with template dilutions  Multiplexing to 384 samples/run  Only 1 PCR w/fusion primers: QC (16S / 18S amplicons on the Illumina MiSeq) QC Quality-control check/step i5 index F primer R primer i7 index P5 adapterP7 adapter 16S/18S sequence Time = 0.5 d Time = 1 d Time = 1 h Time = 1.5 h Time = ~3 d Total Time = 5 d approx.

31 Module 4 bioinformatics.ca

32 Module 4 bioinformatics.ca Questions?

33 Module bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google