Presentation on theme: "The Microbiome and Metagenomics"— Presentation transcript:
1The Microbiome and Metagenomics Catherine LozuponeCPBS 7711September 19, 2013
2What is the microbiome?“The ecological community of commensal, symbiotic, and pathogenic microorganisms that share our body space”Microbiota: “collection of organisms” Microbiome: “collection of genes”Bacteria, Archaea, microbial eukaryotes (e.g. fungi or protists) and viruses.Body SitesImportant roles in health and disease: Gut, Mouth, Vagina, Skin (diverse sites:Nasal epithelial)Important roles in disease: Lung, blood, liver, urine
3The big tree Majority of life’s diversity is microbial Majority of microbial life cannot be grown in pure cultureVastly different metabolic capabilities, tolerance to temperature, depth, salinityPace, N.R.,The UniversalNature of Biochemistry. PNASVol 98(3) pp
4The Human Gut Microbiota 100 trillion microbial cells: outnumber human cells 10 to 1!Most gut microbes are harmless or beneficial.Protect against enteropathogensExtract dietary calories and vitaminsPrevent immune disordersList of diseases associated with dysbiosis ever growingInflammatory Diseases: IBD, IBSMetabolic Diseases: Obesity, MalnutritionNeurological DisordersCancer
6What do we want to understand? What does a healthy microbiome look like?How diverse is it?What types of bacteria are there?What is their function?How variable is the microbiome?Over time within an individual?Across individuals?Functionally?What are driving factors of variability?Age, culture, physiological state (pregnancy)How do changes affect disease?What properties (taxa, amount of diversity) change with disease?Cause or affect?Functional consequences of dysbiosisHost InteractionsEvolution/adaptation to the host over time.Immune system
7Culture-independent studies revolutionized our understanding of gut bacteria Culture-based studies over-emphasized the importance of easily culturable organisms (e.g. E. coli).Culture-independent surveys2.PCR amplify SSU rRNA gene (which species?)Sequence random fragments (which function?)Extract DNA from environmental samples.3. EvaluateSequences
8Gut microbiota has simple composition at the phylum level Different phyla: Animalsand plantsData from: Yatsunenko et. al Nature.
9Diversity of Firmicutes in 2 healthy adults Each person harbors > 1000 species.Some species are unique (red and blue)Some shared (purple)We know very little about what most of these species do!1.8 million sequences per person.
10Sequencing technology renaissance enabled more complex study designs Sanger Sequencing (thousands)Pyrosequencing (millions)Illumina (billions!)Illumina (100 million)Pyrosequencing
11MetagenomicsThe study of metagenomes, genetic material recovered directly from environmental samples.Marker genePCR amplify a gene of interestTells you what types of organisms are thereBacteria/Archaea (16S rRNA), Microbial Euks (18S rRNA), Fungi (ITS), Virus (no good marker)ShotgunFragment DNA and sequence randomly.Tells you what kind of functions are there.
12Small Subunit Ribosomal RNA Present in all known life formsHighly conservedResistant to horizontal transfer events16S rRNA secondary structure
13Other ‘Omics MetaTranscriptomics (sequence version of microarray) Isolate all RNADeplete rRNASequence all transcriptsSometimes phenotype only seen in activity of the microbiotaMetabolomicsWhat metabolites does a community produce?E.g. in feces or urineMetaProteomicsWhat proteins does a community produce?
14Integrating Data Types 16S rRNA -> shotgun metagenomicsWhat gene differences cannot be explained by 16S?Selection by HGT16S/ genomics -> transcriptomics-> metabolomicsWhat species or genes (or combination of species or genes), when expressed, are responsible for producing a given metabolite?
18Short reads (pyrosequencing) can recapture the result. UW UniFrac clustering with Arb parsimony insertion of 100 bp reads extending from primer R357.Assignment of short reads to an existing phylogeny (e.g. greengenes coreset) allows for the analysis of very large datasets.Liu Z, Lozupone C, Hamady M, Bushman FD & Knight R (2007) Short pyrosequencing reads suffice for accurate microbial community analysis. Nucleic Acids Res 35: e120.
19My presentation is going to cover analysis of data with QIIME My presentation is going to cover analysis of data with QIIME. This shows many of the steps within QIIME. I am going to discuss certain steps in some detail, and cover the workflow scripts that automate many of the internal steps.
20Preprocessing pyrosequencing datasets Quality filtering: Discard sequences that:Are too short and too long ( range)With low quality scoresWith long homopolymersCan trim poor quality regions from the endsPyroNoise and ChimerasCan greatly inflate OTU countsPyronoise algorithm uses SFF files to fix noisy sequencesUse barcodes to assign sequences to samples
21Defining species: OTU picking Cluster sequences based on % identity97% id typical for speciesCD-HIT, UCLUSTFor Phylogenetic diversity measures need to make a treeAlign sequences: NAST, PyNASTDenovo tree building: FastTreeAssign reads to sequences in a pre-defined reference tree
22Comparing DiversityOverview of methods for evaluating/comparing microbial diversity across samples using 16S rRNA diversity: Measures how much is there? diversity: How much is shared?Phylogenetic verses taxon based diversity.Quantitative verses Qualitative diversity.What types of taxa are driving the patterns? Which species are associated with measured properties?Tools: UniFrac/QIIME/Topiary ExplorerLozupone, C.A. and R. Knight (2008) Species divergence and the measurement of microbial diversity. FEMS Microbiol Rev
23How do we describe and compare diversity? “How many species are in a sample?”(e.g. 6 colors in A and 6 in B)e.g.: Are polluted environments less diverse than pristine? Diversity:“How many species are shared between samples?”(e.g. 2 shared colors between A and B)e.g.: Does the microbiota differ with different disease states?AB
24Quantitative versus Qualitative measures Qualitative: Considers presence absence only: How many species are in a sample?e.g.: 6 colors in both A and B.How many species are shared between samples?e.g.: A and B are identical because the same colors are present in both.Quantitative: Also considers relative abundance.: Accounts for “evenness”:e.g. B, where the population is evenly distributed across the 6 species, is more diverse than A, where all species are present but red dominates.Samples will be considered more similar if the same species are numerically dominant versus rare.e.g. B and A no longer look identical because of differences in abundance.B
25What is a phylogenetic diversity measure? Taxon: “How many species are in a sample?”Phylogenetic: “How much phylogenetic divergence is in a sample?”(e.g. B more individually diverse than A - more divergent colors) Diversity:Taxon: “How many species are shared between samples?”Phylogenetic: “How much phylogenetic distance is shared between samples?”(only related colors from B are in A)B
26Advantages of phylogenetic techniques. Phylogenetically related organisms are more likely to have similar roles in a community.Taxon-based methods assume a “star phylogeny,” where all relationships between taxa are ignored.Phylogeny and Taxon-based methods can be complementary.
27Diversity Measures Diversity Diversity Phylogenetic Diversity: PDTaxon-based:observed # species (richness)Correct for undersampling (Chao1, Ace)Richness + evenness (Shannon-Weaver index) DiversityTest if samples have significantly different membership.UniFrac Significance, P test, Libshuff (Phylogenetic)Identify environmental variables associated with differences between many samples.PhylogeneticUnweighted and Weighted UniFracDPCoATaxon-based: Jaccard/Sorenson indices
28Phylogenetic Diversity (PD) Sum of branches leading to sequences in a sample.Sample with taxa spanning the most branch length in this tree represents the most phylogenetically and perhaps functionally divergent community.Faith, D.P. (1992) Conservation evaluation and phylogenetic diversity. Biological Conservation 61, 1-10.
29PD RarefactionPlot the amount of branch length against the # of observations.Shape of curve allows for estimating how far we are from sampling all of the phylogenetic diversity.Allows for comparison of phylogenetic diversity between samples.Eckburg, P.B., et al. (2005) Diversity of the human intestinal microbial flora. Science 308,
30Phylogenetic and OTU based techniques can be complementary Results of analyzing the same data with Chao1 and PD.Samples from stool, mouth, lung, plasma, and negative controls.Differentiation between the stool/mouth and negative controls greater with Chao1 than with PDThe negative controls have few OTUs but they are phylogenetically diverseChao1 estimates go up with sampling effort.
31Phylogenetic diversity: How is diversity partitioned across samples? Do two samples contain significantly different microbial populations?Can we see broad trends that relate many samples and explain them in terms of environmental factors?
32Unique Fraction (UniFrac) metric Qualitative phylogenetic diversity.Distance = fraction of the total branch length that is unique to any particular environment.Lozupone and Knight, 2005, Appl Environ Microbiol 71:8228
33Clustering with the UniFrac Algorithm Can we see broad trends that relate many samples and explain them in terms of environmental factors?
34What types of environments have similar phylogenetic diversity? 1-12Temperature0-100°COligotrophicEutrophicPressure1-200 atmNutrientAvailabilityLozupone CA & Knight R (2007) Global patterns in bacterial diversity. Proc Natl Acad Sci U S A 104:
35Salinity is the most important factor PCoA ofUniFracDistanceMatrix
36Hierarchical clustering (UPGMA) of the same UniFrac distance matrix
37Qualitative vs Quantitative measures of Phylogenetic Diversity Unweighted UniFracDetects factors restrictive for microbial growth.High temperature, low pH, founder effects.Quantitative:Weighted UniFrac, DPCoA.Detects transient changes.Seasonal changes, nutrient availability, response to pollution.Yield different, complementary results and applying both to same data can provide insight into nature of community changes.
38Weighted UniFrac Lozupone et al., 2007. Appl Environ Microbiol 73:1576 QualitativeQuantitativeLozupone et al., Appl Environ Microbiol 73:1576
39Obesity and Gut Microbiota Mice heterozygous for mutation in Leptin gene interbreed.16S gene sequenced for bacteria in gut of mothers and offspring.Ley et al., (2005)Obesity Alters Gut Microbiota, PNAS Vol 102: pp
40So how about the obese mice? Mice cluster perfectly by motherLey et al., (2005)Obesity Alters Gut Microbiota, PNAS Vol 102: pp
41Stronger clustering with obesity with Weighted UniFrac
42Comparison of human stool and mucosal microbes Unweighted UniFracComparison of human stool and mucosal microbesUnweighted: all samples cluster by individual.Weighted: stool looks different.Weighted UniFracEckburg, P.B., et al. (2005) Diversity of the human intestinal microbial flora. Science 308,
43Measures in the same class cluster the data similarly Double principal coordinates analysis (DPCoA)Another quantitative diversity measure.A matrix of species distances is first used to ordinate the species using PCoA.The position of the communities in coordinate space is the average position of the species that they contain, weighted by relative abundances.Produces same results as weighted UniFrac.
44Fast UniFracComputation enhancements create order of magnitude increases in speed and reduced memory requirements.Hamady, Lozupone and Knight, The ISME Journal Epub ahead of print.
45Avoiding biasPyrosequencing often produces high variability in the number of sequences per sample.This can introduce bias because undersampling creates inflated beta diversity valuesRandomly resampled a dataset at different depths and calculated the average UniFrac distance.Samples with fewer sequences look artificially different.Rarefaction: randomly select an even amount of sequencesLozupone et al ISME. 5:169-72
46Web interfaces have >2200 registered users. Unifrac papers have collectively 1250 citations.461 citations
50Supervised Learning, classical statistics, taxonomic classification, and phylogenetic trees; How can we use these tools to understand which microbial taxa change across treatments?
51Identifying compositional changes that drive diversity patterns Histograms
52Histograms and trees can pain a different picture Firmicutes16S rRNA gene tree of OTUs prevalent in 2 studies of diet/obesityTurnbaugh 2009 Sci Transl Med. 1:6ra14Ley Nature. 444:1022-3Clostridia clusters XIVa and IV are the most abundant in the healthy gut.Peterson 2008 Cell Host Microbe: 3:417-27Cluster XIVa ~43% of the total bacteria in the stool of healthy individuals (Maukonen J Med Microbiol. 55: )
53Identifying taxonomic determinants Which taxa are significantly different between health and disease?Using OTUs versus classifier derived taxa.PCoA Biplots:Which taxa are correlated with overall clustering patterns?Finding discriminatory OTUs with Supervised Learning.Applying classical statistical tests with out_category_significance.pyExploring relationships in trees.
54Defining Taxa2 methodsOTUsClassifiers (e.g. the RDP classifier)For both methods phylogenetic depth of the taxa can be varied.OTUs – different %IDs (97%, 95%, 90%)Classifiers – different levels (species, genus, family)Advantage of using OTUsCan evaluate phylotypes not related to known species or in taxonomic groups with poorly defined systematics.Each OTU represents an equal amount of phylogenetic divergence.Advantage of using ClassifiersCan more easily relate results to other published results.Fewer taxa than OTUs.
55At what level should I classify? Shallow97% ID OTU or species-level taxonomy assignmentsAdvantageBiological properties of taxa have the potential to be more strictly definedDisadvantageCan loose power to find associations in broader lineages in which a trait is conservedBroad90% ID OTUs or family-level taxonomic assignmentsMore powerful for conserved traitsAssociation in a broader group is often driven by only a subset of its members (i.e. if you detect that Gamma Proteobacteria go up you cannot say that E. coli did it!)
56When ill-defined systematics can cause trouble Clostridium cluster XIVaLachnospiraceaeClostridiumLozupone et al 2012Genome ResearchRuminococcusRuminococcusBlautiaRuminococcusRuminococcusBlautiaClostridiumEubacteriumClostridiumEubacteriumClostridiumEubacteriumClostridium
57PCoA Bi-plotsAllows visualization of taxa and samples in the same PCoA space
58Finding discriminative OTUs 2 methodsSupervised learningClassical statisticsEvaluates how well OTUs/taxa can be used to classify by treatment.Discriminative OTUs are those for which classification power is reduced when they are removed from the setAdvantage:evaluates OTUs contextually rather than independentlyDisadvantage:only works with Discrete sample groupings (i.e. will not handle correlations with disease severity or changes over time)
59Feature importance scores All OTUs with scores > considered ‘important’Yatsunenko et al Nature 2012Problem: We do not know the direction of change.With only two categories – compare the means.
60Classical Statistics Tests in QIIME otu_category_significance.pyi: otu tablem: category mappingc: category (e.g. health status)s: statistical testANOVAPearson correlationPaired T testG-test of independencef: minimum number of samples found in to be consideredRemoves OTUs that don’t pass the filter, performs a statistical test on each OTU, corrects for multiple comparisons with FDR and Bonferroni correction.Can also be run on Taxa Summary tables files if in BIOM format.
61Assign statistical significance values to bar charts
62ANOVA outputI use these means and their significance to assess direction of change in Supervised learning results.
64Are discriminatory OTUs related to each other and to type strains? Relate them in a tree.ARB to make the tree using parsimony insertion.Topiary explorer to visualize/color the tree and make publication quality graphics
66Sometimes associations are phylogenetically shallow Erysipelotrichales with HIV infection
67GenomicsGenomics : Thousands of complete and draft genome sequences for human commensals publicly availablePromise: translate 16S into functional predictions (PiCRUST)Challenges: no genomes for unculturable microbesGenes with high HGTDistribution(16S rRNA)Comparative genomics(complete genomes)ExperimentalConfirmation(anaerobic culture)
68Annotating genes to functions Based on similarity to genes of known function.NCBI genomeshave functions listed for predicted proteins
69Databases for functional assignments COGs (Clusters of Orthologous Groups;KEGG (Kyoto Encyclopedia of Genes and Genomes;CAZy (Carbohydrate Active Enzymes database;pFAM (protein family database;
70COG database Orthologous groups A group of proteins that are expected to perform the same function in the different organisms in which they are found.Function is inferred for the whole group based on experimental work with one of its members.COGs are grouped into larger functional groups.
71KEGG database Orthologous groups (assigned KO numbers) Metabolic pathways.Boxes contain enzyme commission database (EC) numbers.Each EC is associated with KO numbers (a protein family that is known to perform that reaction).
74Exact reaction performed does not need to be known. Glycoside Hydrolases (GH)Degradation: hydrolyze glycosidic bonds between two carbs or between a carb and a non-carb.Important for degradation of plant polysaccharides.Database describing protein families predicted to be carbohydrate active based on homologyUses HMMsExact reaction performed does not need to be known.GlycosylTransferases (GT)Biosynthesis: catalyze the transfer of sugar moeties.Important for communication with host immune system.
75Similar to CAZy but with a broader scope. Hidden Markov Models that describe sequence motifs of a known function
76Annotating genes to taxonomic groups Based on similarity to genes in a database of reference genomes.Mg-RAST uses best BLAST hit: M5N4
77Annotating metagenomes MgRASTProduces Table mapping samples to annotations that can be further processed in QIIME