Yuzhen Ye School of Informatics Indiana University, Bloomington Computational Approaches to Metagenomic Sequences Analysis.

Yuzhen Ye School of Informatics Indiana University, Bloomington yye@indiana.edu Computational Approaches to Metagenomic Sequences Analysis

Metagenomics 2  The term of “metagenomics” was first used in 1998 (Handelsman et al. Chemistry & Biology 1998, 5:R245- R249)  A methodology that applies genome sequencing to the culture-independent analysis of complex and diverse (“meta”) environmental populations of microbes  Metagenomics projects: Global Ocean Survey (GOS), Acid Mine Drainage (AMD), human microbiome project, etc.  Getting broader! Functional metagenomics

The Acid Mine Drainage (AMD) project An acid mine drainage site Biofilms growing on the surface of flowing AMD in the five-way region of the Richmond mine at Iron Mountain, California; sampled in 2000 Acid is produced by oxidation of sulfide minerals that are exposed to air as a result of mining activity 3 Ref: Tyson G et al. Nature 2004, 428:37–43

DNA Sequencing in AMD  A small insert plasmid library (average insert size 3.2 kb)  Shotgun sequencing resulted 72.6 million bp; averaging 737 bp per read Reads could from different individuals, different strains of the same species, and different species 4

Human microbiome projects 5  To characterize the human microbiome (the totality of microbes living on and within human body) and its role in health and disease  An often asked question: is there a core human microbiome?  NIH HMP website: http://nihroadmap.nih.gov/hmp/

rRNA-based or metagenomic 6  Small-subunit ribosomal RNA (rRNA) studies for microbial community profiling (involving PCR of 16s RNAs) → 16s RNA for Bacteria & 18s RNA for Archaea → rRNAs are used as phylogenetic markers to define which lineages are present in a community → barcoded pyrosequencing allow deeper view of a microbial community  Metagenomic studies for functional profiling—community DNA is subject to shotgun sequencing → Often used sequencing techniques: 454 pyrosequencing & Solexa/Illumina → Metagenomic studies are usually more expensive than rRNA-based, but they are essential for understanding the functions encoded in a metagenome (collection of genomes) Ref: Genome Res. 2009. 19: 1141-1152

Computational problems and challenges 7  Problems → Assembly → Identification of community species: Phylotyping versus binning → Function annotation → Comparative analysis  e.g, UniFrac and SONS for comparing microbial communities  Challenges → Scale: development of computational tools that can handle input on “metagenomic” scale → Complexity: a metagenome contains genetic elements from various genomes (could be huge)

Genome- or gene-centric approaches  Genome-centric analyses → Similar to traditional genome projects → Worked in the AMD project  A lucky pick—low species diversity  Reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of 3 other genomes; JAZZ was used → Not work well on datasets from samples with high species diversity and/or low sequencing coverage  Gene-centric analyses → Environmental gene tags (EGTs): short DNA sequences that contain fragments of functional genes → EGTs “fingerprints” can be compared across multiple sites or habitats or over time in the same environment → Overrepresented or underrepresented EGTs can provide insights into unique metabolic capabilities associated with a particular environment even if it is not possible to assign a particular EGT to a particular environment 8 Ref: Eisen JA. Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol. 2007, 5(3):e82

Gene-centric approaches need to be improved  Partial (fragmental) genes/proteins → Application of next-generation sequencing technologies (e.g., many of the metagenomic projects applied Roche/454 and Roche/Illumina sequencing technologies for WGS, producing even shorter reads)  Difficulties in analyzing fragmental genes → It is difficult to correctly predict partial genes from DNA fragments → And gene length does matter (Ref: Wommack et al. Appl Environ Microbiol. 2008 Mar;74(5):1453-63) 9

What’s covered in this talk 10  MetaORFA: ORFome assembler  MinPath: a parsimony approach to biological pathway reconstruction

MetaORFA: ORFome assembler 11  ORFome: all the ORFs (open reading frames) found in a given set of DNA sequences  ORFome assembly: assemble ORFs into longer peptides (so that similarity search using assembled peptides may achieve higher sensitivity and specificity )  References → Accepted paper in CSB 2008; JBCB. 2009, 7(3):455-71 Assembled peptide 18 ORFs

ORF assembly versus genome assembly 12 Whole genome assemblyORFome assembly ORF prediction Assembly at DNA level Assembly at protein level

Why assemble at the protein level?  When whole genome assembly is difficult (short-reads, low-coverage, high species complexity, and repeat-like DNA sequences—shared DNA elements among different species)  Metagenomic DNA sequences could be from different individuals (so there will be mutations that further complicate DNA level assembly)  Many mutations (hopefully) are synonymous (do not change amino acid)  We can assemble proteins first! 13

ORF identification  The typical strategy “start from a start-codon and stop at a stop-codon” won’t work (because of the fragmental nature of the metagenomic sequences)  We use all potential ones  And we masked the DNA sequences prior to the ORF identification using MDUST and Tandem Repeat Finder 1 2 3 14 -3 -2 1 2 …

ORF assembly algorithm: Eulerian path approach Fragments = {ATG, TGC, GCG, CGT, GTG, GCT, CTG, GCA} Vertices correspond to (l–1)-mers : {AT, TG, GC, CG, GT, CT, CA} Edges correspond to l–mers from fragments (e.g., TGC; we used l = 10) Assembled sequences: path visited every EDGE e.g., ATGCGTGCTGCA ATGCTGCGTGCA Repetitive sequences are represented by a single edge (TGC) GT CG CA GCTG CT AT De Bruijn graph Ref: Pevzner, Tang and Waterman (2001), An Eulerian path approach to DNA fragment assembly. PNAS 98:9748-9753

ORFome assembly reports family graph MLSDFPVS FPVSTLI TLIARCV RCVLNSTY MLSNFP PVSTV STVFAKTTL TTLNST LNSTY Peptides from reads Sequencing & ORF identification MLSD LIARCV FPVST LNSTY MRSN VFAKTT De Bruijn graph construction Protein family graph Protein I: MLSDFPVSTLIARCVLNSTY Protein II: MRSNFPVSTVFAKTTLNSTY Currently the sequences corresponding to the edges are used in the search!

MetaORFA 17  Input: ORFs prediction  Output: assembled peptides  MetaORFA runs fast → but the downstream analysis, similarity search and family annotation, of the ORFs/assembled peptides may be time consuming

Test of ORF length cutoff 18  Short ORFs may not be real  Too many short ORFs slow down the assembly

Test on real metagenomic datasets 19  Four datasets each containing metagenomics sequences of a major oceanic region community (the four regions are Sargasso Sea, Coast of British Columbia, Gulf of Mexico, and Arctic Ocean) (referred to as Ocean Virus datasets).  The reads were acquired by 454 sequencing machine, and they are typically very short.  All the metagenomic sequences were downloaded from CAMERA website (http://camera.calit2.net/)

ORFome assembly results 20 A-Pep: assembled peptide Table: Statistics of the ORFs and ORFome assembly results for Ocean Virus datasets

More reads hit similar sequences 21 total reads=688590 searched against IMG database (the integrated microbial genomes system) version 2.4

More functional categories are identified 22 Table: Summary of the family annotation of assembled peptides versus unassembled reads for the four ocean virus datasets PTHR22748, AP endonuclease (E-value = 2.5e-12); PTHR11527 (subfamily SF15), heat shock protein 16 (E-value = 1.5e-07); PTHR21535 (subfamily SF1), magnesium and cobalt transport protein (E-value = 8e-09); PTHR17630 (subfamily SF20), carboxymethylenebutenolidase (E-value = 4.7e-08) PANTHER family classification was used for family (subfamily) annotation PANTHER HMM library was downloaded from ftp://ftp.pantherdb.org and associated HMM searching tool (pantherScore.pl) was usedftp://ftp.pantherdb.org

Longer peptides carry more specific information 23  For the Arctic Ocean dataset  Assembled peptides add 113 subfamilies to the annotation using unassembled short ORFs (1524 subfamilies)  An example with “mis-annotation” at subfamily level PTHR11935:SF11 (Glyoxalase II) SCUMS_READ_Arctic2924400-r2 SCUMS_READ_Arctic2876600-r1 SCUMS_READ_Arctic2285121-f2 SCUMS_READ_Arctic2455735-f0 SCUMS_READ_Arctic2538177-f0 SCUMS_READ_Arctic2813824-r18 PTHR11935:SF10 (Beta lactamase domain) Assembled peptide

Assembled peptide may involve synonymous mutations 24

New/future developments 25  Improve assembly algorithm (A-Bruijn graph algorithm)  Improve ORF identification → Current prediction may include many false ORFs  Systematically study DNA polymorphism in metagenomic sequences  Apply ORF assembly results to improve metagenomic sequence annotation  Utilize ORF assembly results to assist DNA-level assembly

MinPath: a parsimony approach to biological pathway reconstruction for genomes and metagenomes

Biological pathways are key to the understanding of biological functions

Smaller units (e.g., KEGG pathways) are extremely important for the understanding of biological functions

Genome of an endosymbiont coupling N2 fixation to cellulolysis within protist cells in termite gut Image from: http://www.sciencemag.org/cgi/content/full/322/5904/1108/DC1http://www.sciencemag.org/cgi/content/full/322/5904/1108/DC1 Ref: Science 322(5904): 1108 – 1109, 2008

f1f1 f2f2 f3f3 f4f4 f5f5 The naïve mapping approach collects all pathways with one or more associated families annotated p1p1 p2p2 p3p3 p4p4 f6f6 MinPath keeps only the minimal set of pathways that explain all the functions annotated f1f1 f2f2 f3f3 f4f4 f5f5 p1p1 p2p2 p4p4 f6f6 MinPath: a parsimony approach to biological pathway reconstruction Reference: PLoS Computational Biology (to appear)

Why MinPath  Pathway reconstruction based on some new high throughput techniques (e.g., proteomics, and metagenomics) must provide conclusions from explicitly incomplete information (metagenomes, unlike genomes, are most likely incomplete). There will be missing enzymes in reconstructed pathways— are they real missing enzymes, or they are simply not sampled?  Existing methods of pathway reconstruction or inference (e.g., the naïve mapping approach shown in previous slide) may over-estimate the number of pathways because of redundancy in the protein-pathway (e.g., different pathways may share the same biological functions, some proteins carry out multiple biological functions).  A more conservative pathway reconstruction (such as the minimal pathway formula used in MinPath) may actually work better

Minimal pathway reconstruction problem solved by integer programming algorithm The goal is to find the minimum number of pathways that can explain all the functions carried by at least one protein from a dataset M is the mapping of protein functions to the pathways as, where M ij = 1 if function i is involved in pathway j, otherwise 0 (note one function may map to multiple pathways or subsystems). Denote if a pathway j is selected in the final list or not as P j, with P j = 1 if selected, P j = 0 otherwise. The set of pathways with P i = 1 composes the minimal set of pathways that can explain all the functions that are annotated for a dataset.

Protein function and function annotation  K numbers or fig families are used for functions depending on which pathway database is used → K numbers (or KO families) for KEGG pathways → fig families for SEED subsystems  Function annotations are based on similarity search → The fig family release comes with a script for fig family annotation (the fig annotations used in the MinPath paper were downloaded from MG-RAST server) → KO families were downloaded from KEGG server, or predicted based on the best blast hits with E-value cutoff of 1e-5

Pathway reconstructions of genomes by MinPath MinPath gives an estimation of functional diversity of various genomes (measured by the number of pathways constructed) that is closer to the curated KEGG database as compared to the naïve mapping approach

Selected pathways eliminated by MinPath (Human genome)

An example of pathway eliminated by MinPath: the ascorbate and aldarate metabolism pathway Only three enzymes are annotated in the human genome (highlighted in green), none of which are unique to this pathway.

Pathway reconstructions by MinPath for metagenomes In a single sequence set from the coral biome (4440319.3.dna.fa) the naïve mapping approach identified 224 KEGG pathways, whereas MinPath identified only 143 KEGG pathways. The pathways eliminated by MinPath include the inositol metabolism pathway, the androgen and estrogen metabolism pathway, the caffeine metabolism pathway. The metagenomes used here are from Dinsdale EA et al. 2008, Nature 452: 629-632

Potential applications of MinPath 38  To improve function annotation of metagenomic sequences with more carefully constructed biological pathways  To give a more reliable estimation of the functional ability of a community, which is essential for understanding a community, and for comparing the functional diversity of different communities

Acknowledgements 39  Yu-Wei Wu, Tom Doak, Mina Rho, and Quan Zhang  Colleagues at the School of Informatics and Computing, Indiana University, Bloomington → Drs. Haixu Tang, Sun Kim, Mehmet Dalkilic, Matthew Hahn and Predrag Radivojac  NIH grant 1R01HG004908-01 (Fragment Assembly and Metabolic/Species Diversity Analysis for HMP)  MetaCyt Initiative at Indiana University, funded by Lilly Endowment

The game is just begun 40

Yuzhen Ye School of Informatics Indiana University, Bloomington Computational Approaches to Metagenomic Sequences Analysis.

Similar presentations

Presentation on theme: "Yuzhen Ye School of Informatics Indiana University, Bloomington Computational Approaches to Metagenomic Sequences Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yuzhen Ye School of Informatics Indiana University, Bloomington Computational Approaches to Metagenomic Sequences Analysis.

Similar presentations

Presentation on theme: "Yuzhen Ye School of Informatics Indiana University, Bloomington Computational Approaches to Metagenomic Sequences Analysis."— Presentation transcript:

Similar presentations

About project

Feedback