Greengenes.lbl.gov 16S rRNA gene database and workbench compatible with ARB Todd DeSantis, Phil Hugenholtz, Niels Larson, Igor Dubosarskiy, Jordan Moberg,

Slides:



Advertisements
Similar presentations
Lawrence Berkeley National Lab Center for Environmental Biotechnology Todd DeSantis, Sonya Murray, Jordan Moberg, Gary Andersen Microarrays.
Advertisements

Mutation Analysis Server Nagarajanlab. © Copyright 2005, Washington University School of Medicine. 2 Agenda Mutation pipeline overview High level design.
Metabarcoding 16S RNA targeted sequencing
10Aug2007Jonathan Davies: IISME/CSEE Jonathan Davies West Linn High School West Linn, OR Mentor : Todd DeSantis Gary Andersen’s Molecular.
Microbiologists conducting surveys of bacterial and archaeal diversity often require comparative alignments of thousands of 16S rRNA genes collected from.
1 Les mesures de diversité microbienne par séquençage massivement parallèle Richard Christen CNRS UMR 6543 & Université de Nice
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Selection of Optimal DNA Oligos for Gene Expression Arrays Reporter : Wei-Ting Liu Date : Nov
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
Introduce to Microarray
How to use the web for bioinformatics Ethan Strauss X 1171
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Sequence Analysis. DNA and Protein sequences are biological information that are well suited for computer analysis Fundamental Axiom: homologous sequences.
Chapter 14 Jizhong Zhou and Dorothea K. Thompson.
Comparative Genomics of Viruses: VirGen as a case study Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune Pune
PCR Primer Design Guidelines
This Week: Mon—Omics Wed—Alternate sequencing Technologies and Viromics paper Next Week No class Mon or Wed Fri– Presentations by Colleen D and Vaughn.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Molecular Microbial Ecology
Gene Expression Omnibus (GEO)
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
T-COFFEE Multiple Alignments of Orthologous Sequences Horizontal Gene Transfer (Phylogenetic Trees) WebLogo.
Probes can be designed in an evolutionary hierarchy.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Greengenes: A Tutorial
Rapid taxonomic classification of complex consortia of environmental rDNA using a microarray. CEB - ESD - LBNL Todd DeSantis, Sonya Murray, Jordan Moberg,
Christian Rinke Microbial Genomics DOE, Joint Genome Institute Introduction to ARB (From A User's Perspective)
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Part I: Identifying sequences with … Speaker : S. Gaj Date
FISH SPECIES IDENTIFICATION AND BIODIVERSIFICATION IN ENUGU METROPOLIS RIVER BY DNA BACODING PRESENTED BY Chioma Nwakanma (PhD) Michael Okpara University.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Molecular Techniques in Microbiology These include 9 techniques (1) Standard polymerase chain reaction Kary Mullis invented the PCR in 1983 (USA)Kary.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Rapid quantification and taxonomic classification of a complex consortium of rDNA amplicons from both prokaryotic and eukaryotic origins using a microarray.
Nature Genetics Vol.36 Sept 2004 Detection of Large-scale Variation In the Human Genome Iafrate, Feuk, Rivera, Listewnik, Donahoe, Qi, Scherer, Lee any.
Figure S1 The North Sea beach of the Dutch barrier island of Schiermonnikoog (N53°30’ E6°10’). The transect indicates the chronosequence along the developing.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
(PSI-)BLAST & MSA via Max-Planck. Where? (to find homologues) Structural templates- search against the PDB Sequence homologues- search against SwissProt.
Proteome and Gene Expression Analysis Chapter 15 & 16.
Accurate estimation of microbial communities using 16S tags
PICODIV will amass large amount of data –cultures –sequences –environmental data Databases –keep track of data produced –verify the data –avoid errors.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Introduction to Oligonucleotide Microarray Technology
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Tools for microbial community analysis. What I am not going to talk  Culture dependent analysis  Isolate all possible colonies  Infer community  Test.
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Rob Edwards San Diego State University
Polymerase Chain Reaction
Part 3 Gene Technology & Medicine
EDNA analyze Wang Ying & Huang Junman.
Welcome to Introduction to Bioinformatics
Workshop on the analysis of microbial sequence data using ARB
Lecture 4: Probe & primer design
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Chapter 14 Bioinformatics—the study of a genome
Microarrays Lawrence Berkeley National Lab
BLAST.
Basic Local Alignment Search Tool (BLAST)
Lauren M. Mathews, Susan Y
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Polymerase Chain Reaction (PCR)
Presentation transcript:

greengenes.lbl.gov 16S rRNA gene database and workbench compatible with ARB Todd DeSantis, Phil Hugenholtz, Niels Larson, Igor Dubosarskiy, Jordan Moberg, Yvette Piceno, Ingrid Zubieta, Eoin Brodie, Gary Andersen LBL - JGI

Andersen Group Program Aims Creating a microarray for the simultaneous differentiation and quantification of closely related prokaryotes in complex samples.

The Biomarker 16S rDNA - identify and classify organisms by gene sequence variations. 16S rDNA rRNA (functional molecule) LSU SSU

The Challenges 16S sequence deposit rate is increasing. Many are mis-annotated and/or chimeric. Sequence Taxonomy updates lags years behind sequence availability (“Bacteria, Unclassified”). Difficult to create and manage MSAs of all 16S seq data (or even thousands) using Clustal/BioEdit/Arb. Probe quality is reliant on excellent MSAs and taxonomy. “Signatures” can erode as more sequences are discovered.

greengenes.lbl.gov

greengenes.lbl.gov Stay current Source: http://www.ncbi.nlm.nih.gov/ ‘16S NOT 1.16S NOTmitochondr* NOT 18S’

greengenes.lbl.gov Verify ‘16S-ness’ Fate of NCBI Records:  short FASTA file (9%) short BLAST match length (8%) BLAST match to 18S/Mito SSU (1%) odd nt insertions (1%) passed (81%)

NAST align step 1: find template Hand curated MSA provided by Phil. Alignment "template" is top BLAST HSP q= -1, Favors long match Candidate trimmed of extra-16S seq data tRNA, intergenic spacer regions, and 23S rDNA based on HSP boundries If HSP paired opposite strands, candidate is reverse complemented. NAST align step 1: find template

NAST align step 1: find template Hand curated MSA provided by Phil. Alignment "template" is top BLAST HSP q= -1, Favors long match Candidate trimmed of extra-16S seq data tRNA, intergenic spacer regions, and 23S rDNA based on HSP boundries If HSP paired opposite strands, candidate is reverse complemented. NAST align step 1: find template

NAST align step 1: find template Hand curated MSA provided by Phil. Alignment "template" is top BLAST HSP q= -1, Favors long match Candidate trimmed of extra-16S seq data tRNA, intergenic spacer regions, and 23S rDNA based on HSP boundries If HSP paired opposite strands, candidate is reverse complemented. NAST align step 1: find template

NAST align step 1: find template Hand curated MSA provided by Phil. Alignment "template" is top BLAST HSP q= -1, Favors long match Candidate trimmed of extra-16S seq data tRNA, intergenic spacer regions, and 23S rDNA based on HSP boundries If HSP paired opposite strands, candidate is reverse complemented. NAST align step 1: find template

NAST align step 2: gap removal Preserves global MSA positions(columns) by allowing local misalignments. DEFINE St = post-Align0 template sequence. Sc = post-Align0 candidate sequence. Ht = alignment space (hyphen) inserted into St by Align0. Hc = alignment space (hyphen) inserted into Sc by Align0.   WHILE (St contains one or more Ht) DO LHt = character index of distal 5' Ht within St L5' = character index of Hc within Sc which is 5' proximal to Ht L3' = character index of Hc within Sc which is 3' proximal to Ht IF ((LHt – L5') > (L3' – LHt)) Delete Hc found at L3' ELSE Delete Hc found at L5' Delete template gap character. END WHILE Result: Largest MSA of full-length (>1250 nt) 16S rDNA genes.

greengenes.lbl.gov Name generator Genbank record Is sequence from whole genome record? NCBI annotations are non-standardized Determine if sequence is from an isolate, environmental amplicon/metagenome Concatenate useful terms Effort to guide future GenBank submitters in clear record descriptions http://www.jgi.doe.gov/16s/ no Glob text from “DEFINITION”, “source”, and “TITLE” “Genus species” style name in DEFINITION or source>organism? Does a source>isolate field exist? Text glob contains “clone” OR “uncultur”? yes yes no yes no yes Record is from an isolate no if Gs Gs result? “Gs yes” Text glob “Isolate tag no” “Isolate tag yes” “Gs no” yes no yes no Text glob contains “symbiont”? Strain tag is present Record is from a clone Isolate tag present? Record is from a symbiont Record is from undecided yes no Record is from a isolate_str

greengenes.lbl.gov Chimera tracking Amplicons from complex gDNA can contain partial sequence from more than one genome. Up to 4% of sequences are deemed chimeric by Bellerophon2 Flags are set to avoid using these questionable sequences in phylogeny assessments

greengenes.lbl.gov Maintain Taxonomy JGI taxonomy organized in ARB using maximum parsimony tree insertions. Example: http://greengenes.lbl.gov/cgi-bin/User/show_one_record_v2.pl?prokMSA_id=82172 prokMSA_id: 82172 prokMSAname: termite gut clone Rs-050 GenBank ACCESSION: AB100461.1, GenBank GI: 28971862, RDP_id: S000122947, NCBI_tax_id: 203524, Study_id: 21358 G2_chip_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Peptostreptococcaceae; sf_5; otu_2988 JGI_tax_string=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae; Mogibacterium JGI_tax_string_format_2=Bacteria; Firmicutes (incl. basal lineag; Firmicutes; Peptostreptococcaceae; Mogibacterium; otu_415 Pace_tax_string=Bacteria; Firmicutes; Clostridium et al.; Peptostreptococcaceae; Clostridium acidiurici et al.; Clostridium difficile et al.; Clostridium aminobutyricum et RDP_tax_string= Bacteria; Firmicutes; Clostridia; Clostridiales; unclassified_Clostridiales. ncbi_tax_string=Bacteria; Firmicutes; Clostridia; Clostridiales; Eubacteriaceae; environmental samples

greengenes.lbl.gov Maintain Taxonomy

greengenes.lbl.gov Tools BLAST SimRank Probe matcher Text search PCR primer design Private NAST aligner

greengenes.lbl.gov Compatible with ARB Entire data base download-able in ARB format. Can import new records into personal ARB data base.

How we use greengenes data to get our work done…..

16S Sequence clustering Each sequence reduced to an array (list) of “probe-friendly” 25-mers which: Have high complexity Can be synthesized with 75 or fewer masks Adequate H-bond potential G+C content over 48% Or empirical bond stability found in test arrays Transitive clustering by fraction of 25mers in common Cluster considered an Operational Taxonomic Unit (OTU)

Extended Bergey’s Taxonomy Bergey’s v0.9 with added nomenclature from Hugenholtz tree of environmental DNA Each OTU assigned to one of 455 families Families split into subfamilies where >15% sequence variation existed. Results: (considering both domains) 63 phyla 136 classes 262 orders 455 families 842 subfamilies (~94% identity) 8,989 OTUs (~99% identity) 30,627 sequences (each belong to only one OTU)

Probe Design Example of the Location of Probes Used for Desulfovibrio sp. str. DMB. Desulfovibrio sp. 'Bendigo A' Desulfovibrio vulgaris DSM 644 Example of the Location of Probes Used for the Desulfovibrio vulgaris Probe Set Sequence discrepancies Regions not unique to OTU Bacteria; Proteobacteria; Deltaproteobacteria; Desulfovibrionales; Desulfovibrionaceae; sf_1; otu_10051 Regions unique to OTU

Locus Specific Prevalence Scoring 22/22 25/25 20/25 Example: proteobacteria OTU composed of 26 sequences Locus Specific Prevalence Scoring

Probe selection objectives for each OTU Find 11 or more 25mers (targets) >90% prevalent in an OTU’s sequences dissimilar from sequences outside the OTU >48% G+C or empirically responsive >1 loci within 16S rDNA gene Presumed cross-hybridizing probes were those 25-mers that contained a central 17-mer matching sequences in more than one OTU (Urakawa, Stahl et al. 2002) avoiding probes that were unique solely due to a mismatch in one of the outer four bases. As each PM probe (Perfect Match to target) was chosen, it was paired with a control 25-mer (mismatching probe, MM), identical in all positions except the thirteenth base. The MM probe did not contain an internal 17-mer complimentary to sequences in any OTU.

Overview of Sample Preparation C G T A C G T A C G T A C G T A C G T Extract Genomic DNA PCR Amplify DNA 18 µ Fractionate DNA 18 µ End-label with biotin Hybridize

Image Capture and Data Reduction Over 500,000 data points Image Capture and Data Reduction Scores for each of 9000 OTUS

Distribution of 16S rDNA Sequences detected via Cloning or Microarray Analysis Clone Hits Only (8) Clone and Array Hits (73) Array Hits Only (97) Confirmed by specific PCR and sequencing: Actinobacteria; Actinosynnemataceae; sf_1 Nitrospira; Nitrospiraceae; sf_1 Clostridia; Syntrophomonadaceae; sf_5 Planctomycetes; Plantomycetaceae; sf_3 Gammaproteobacteria; Pseudoaltermonadaceae; sf_1 Acidobacteria; Ellin6075/11-25; sf_1 Spirochaetes; Spirochaetaceae; sf_1 Spirochaetes; Spirochaetaceae; sf_3 Spirochaetes; Leptospiracea; sf_3

Array is quantitative r = 0.917 Spike–in % G+C sequence % G+C probes Mycoplasma neurolyticum 50.0 45.4 Oenococcus oeni 50.9 50.8 Saprospira grandis 51.8 Fervidobacterium nodosum 58.2 53.8 Caulobacter vibrioides 56.4 58.5

Array is quantitative ~1011 16S gene copies ~107 16S gene copies

Example query against meteorological data: Does detection of Actinobacterium PENDANT-38 correlate with temperature?

Real-time quantitative PCR confirmation of array monitoring. Uranium Bioremediation – is uranium re-oxidation under reducing conditions due to loss of metal reducers? (a) Array quantitation Representative organism Phylocode Group Corrected Array Intensity Area 2 Reduction Oxidation Geothrix fermentans 2.13.8.386 Acidobacteriaceae 45 2344 2290 Geobacter metallireducens 2.28.4.7.4.10207 Geobacteraceae 251 2238 2188 Geobacter arculus 2.28.4.7.4.10209 38 1412 1698 (b) qPCR quantitation Species specific - Geothrix fermentans Group specific - Geobacteraceae

Real-time quantitative PCR confirmation – Urban Aerosol Array hybridization signal correlates significantly with 16S copies in environmental aerosol DNA extract Pseudomonas oleovorans example

FEMS Letters - pseudoshift Order Class Peak Duration (sec) Phaeophyceae (phylum) Stramenopiles (no rank) 5 Basidiomycota (phylum) Fungi (kingdom) 45 Deferribacterales Cyanobacteria 450 Ascomycota (phylum) Vibrionales Gammaproteobacteria Flavobacteriales Flavobacteria Clostridiales Clostridia Rhizobiales Alphaproteobacteria Rhodospirillales n.s. Lactobacillales Bacilli Bacillales Mycoplasmatales Mollicutes Xanthomonadales Burkholderiales Betaproteobacteria Sphingomonadales Sphingobacteriales Sphingobacteria Acholeplasmatales

Acknowledgements Phil Hugenholtz – Taxonomy, Arb Interface, Chimera Niels Larson – SimRank Igor Dubosarskiy – JSP Jordan Moberg – Microarrays, Cloning Yvette Piceno – Microarrays, Primer Design Ingrid Zubieta – PCR, Cloning Eoin Brodie – Microarrays, QPCR Gary Andersen – 16S Microarray Group Leader

C. perfringens probe set identified in EPA sample 22 (N.Y. Spring) CFB C.AURANTIBUTYRICUM C.THERMOBUTYRICUM_SUBGROUP C. BUTYRICUM Cyan High G+C C.ALGIDICARNIS Bacteria Proteo C.BOTULINUM_SUBGROUP Bacil-Strep C.CADAVERIS Gram + C.PERFRINGENS Clostridium C.BARATI_SUBGROUP 27 1492 16S rDNA 420 469 5 6 7 8 ................................................................... C. perf. resistant ...CGTAAAGCTCTGTCTTTGGGGAAGATAATGACGGTACCCAAGGAGGAAGCCACGGCTAACT... C. perf. str.CPN50 ................................................................... Clostridium sp. AB&J ................................................................... clone p-4636-2Wa2 ................................................................... C. perf. A ................................................................... C. perf rrnA ................................................................... C. perf rrnE .................................T................................. C. perf rrnD ................................................................... C. perf rrnC ................................................................... C. perf rrnB ................................................................... C. perf rrnF ................................................................... C. perf rrnG ................................................................... C. perf str.13a ................................................................... C. perf str.13b ................................................................... C. perf rrnH ................................................................... C. perf rrnI ................................................................... C. perf rrnJ ................................................................... clone OI1612 ................................................................... C. perf. B ................................................................... Swine manure 37-3 ................................................................... Swine manure 37-4 TAAAGCTCTGTCTTTGGGGAAGATA tacccaaggaggaagccacggctaa AAAGCTCTGTCTTTGGGGAAGATAA AAGCTCTGTCTTTGGGGAAGATAAT AGCTCTGTCTTTGGGGAAGATAATG Ave Diff =1891 Probe Properties: 25mer exits in 90% of the taxon’s seqs Internal 21mer exists only in one taxon. Probes 5 - 8