GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Escherichia coli, strain CFT073, uropathogenic Escherichia coli, strain EDL933, enterohemorrhagic Escherichia coli K12, strain MG1655, laboratory strain,
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Metabarcoding 16S RNA targeted sequencing
Community Phylogenetic structure with R. Central question in community ecology What processes are responsible for the identity and relative abundances.
Computational Analysis of the Taxanomical Classification of Short 16S rRNA Sequences Christel Chehoud Mentor: Brian Haas.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Vibrio Genomes Naama Dekel, Koshlan Mayer-Blackwell, Marcus Schicklberger, Holly Sewell, Will Stork June 23 rd 2010.
Protein Functional Site Prediction The identification of protein regions responsible for stability and function is an especially important post-genomic.
Heuristic alignment algorithms and cost matrices
Sequence and structure databanks can be divided into many different categories. One of the most important is Supervised databanks with gatekeeper. Examples:
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Description of Group B Streptococcus Pan-genome Genome comparisons of 8 closely related GBS strains Tettelin, Fraser et al., PNAS 2005 Sep 27;102(39)
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Central Dogma Information storage in biological molecules DNA RNA Protein transcription translation replication.
Fuzzy K means.
Two Component Systems Sequence Characteristics Identification in Bacterial Genome Yaw-Ling Lin Dept Computer Sci. & Info. Management, Providence University,
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Goals: Discuss 3 examples of transcriptional regulation -Lac operon -Coordinated gene regulation -Regulation of transcription without regulation of polymerase.
Chapter 17 Prokaryotic Taxonomy How many species of bacteria are there? How many species can be grown in culture? Bergey’s Manual Classification Schemes.
From Metagenomic Sample to Useful Visual Anna Shcherbina 01/10/ Anna Shcherbina Bioinformatics Challenge Day 02/02/2013 From Metagenomic Sample to.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Automatic ssu- rRNA novelty ranking pipeline Ssu-RNA sequence s One ranking score for each sequence for phylogenetic novelty Dongying Wu 03/2015.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Christian Rinke Microbial Genomics DOE, Joint Genome Institute Introduction to ARB (From A User's Perspective)
Metagenome Analysis: a case study Analysis of a thermophilic terephthalate-degrading syntrophic community Thanos Lykidis.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Distinguishing Strains Individual bacterial species and strains may be distinguished by: RFLP or rep-PCR analysis Protein profiling Immunological tests.
Figure S1 The North Sea beach of the Dutch barrier island of Schiermonnikoog (N53°30’ E6°10’). The transect indicates the chronosequence along the developing.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
The Microbiome and Metagenomics
Controlling Gene Expression. Control Mechanisms Determine when to make more proteins and when to stop making more Cell has mechanisms to control transcription.
Subtrees Comparison of Phylogenetic Trees with Applications to Two Component Systems Sequence Classifications in Bacterial Genome Yaw-Ling Lin 1 Ming-Tat.
Regulation of Gene Expression in Bacteria and Their Viruses
Accurate estimation of microbial communities using 16S tags
Complex mammalian gene control regions are also constructed from simple regulatory modules.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
CCAGTTGCCGCGTTCACCCTCTCCTCATCCGCGGTTCACCGGCCTCGTTGAGACTGCCTG  SCO0033 GGCCGTCATTCCGACAGCACCCACGTCTCACTCCCCGTGCCCATGCGGGGACCGGGCGGC CCGGCAGTAAGGCTGTCGTGGGTGCAGAGTGAGGGGCACGGGTACGCCCCTGGCCCGCCG.
MYCOBACTERIUM TUBERCULOSIS PROTEOME M. tuberculosis- intracellular pathogen - TB prevalent in Africa and Asia - 1/3 population is infected - 8 million.
Subsystem: General secretory pathway (sec-SRP) complex (TC 3.A.5.1.1) Matthew Cohoon, Department of Computer Science, University of Chicago, Chicago, IL.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Modifying the Amino Acid Sequence in the Surface-Exposed Loops of the Omptin Family of Proteins to Determine Their Effect on Function Methods Site Directed.
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Functional profiling with HUMAnN2
Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments Xinjun Zhang.
Transcription(I) 王之仰.
RNA and protein synthesis
Phylogeny - based on whole genome data
The Original Question:
Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007.
Workshop on the analysis of microbial sequence data using ARB
Diverse Transcriptional Programs Associated with Environmental Stress and Hormones in the Arabidopsis Receptor-Like Kinase Gene Family  Lee Chae, Sylvia.
Gene Family Ancestral State Phylogenetic Profiling
RNA and protein synthesis
G. Eric Schaller, Shin-Han Shiu, Judith P. Armitage  Current Biology 
Phylogenetic comparison among selected Pasteurella multocida and Haemophilus influenzae species with completed genome sequences. Phylogenetic comparison.
Design of Tunable Synthetic Overlapping Divergent Promoters
Phylogeny and expression landscape of Wolbachia in D. melanogaster.
Presentation transcript:

GEBA Project Summary Dongying Wu

Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes, 53 genomes

Phylogenetic Distance (PD) PD=sum of all the branch lengths PD{A,B,C}=a+b+c+d A B a b C c d

Phylogenetic Distance Contribution of GEBA genomes 53 random non-GEBA taxa (from a pool of 667) contribute 3.15 to the tree PD (standard deviation:0.68 for 100 sampling) The total tree PD is 88.8, GEBA add 11.0 to the tree. The 26 GEBA actinobacteria add 4.29 to the total PD (actinobacteria as a whole add PD) 26 random non-GEBA actinobacteria (from a pool of 47) contribute 1.37 PD (standard deviation 0.28, 100 sampling)

227,562 genes from 56 genomes => 17,176,180 links Blastp: E value cutoff 1e-10, report hits Only blastp hits that span 80% of the lengths of both genes are kept as links Gene Family Classification

Links (matrix of sequence identities) Expansion Inflation (I=2) MCL Clustering Algorithm equilibrium state

/ / /56 2/56 - 5/56 5/ /56 1/ Number of Families F a m i l y S i z e ( g e n e s / g e n o m e )

Evenness estimation genomeGene distribution ratio for family X A0.316 B0.105 C0.026 D0 E0.184 F0.215 G0.158 Median dist: Distance averrage =0.087 Evenness=100 x e -4 x dist 0.031

Universality: ratio of genomes that a family appears in Evenness: even distribution of gene family members across genomes Size: number of members in a gene family

Family size

Large families: famID size functions F (75/genome)ABC-type transport system ATP-binding protein F (27/genome)multi-sensor hybrid histidine kinase F (24/genome)short chain dehydrogenase F (20/genome)acyl-CoA synthetase F (14/genome)serine/threonine protein kinase F (13/genome)two-component system response regulator (LuxR family) F (13/genome)two-component system response regulator (winged helix family) F (11/genome)drug resistance transporter F (11/genome)transcriptional regulator, LacI family F (10/genome)two-component system sensor sensor histidine kinase F (10/genome)sugar ABC transporter, permease component

Low universality large families: famID size organismfamily functiontaxonomy number F outer membrane proteinBacteroidetes; Proteobacteria F outer membrane protein Bacteroidetes F anti-sigma factor Bacteroidetes; Proteobacteria F transcriptional regulator, AraC family Bacteroidetes; proteobacteria F RNA polymerase ECF-type sigma factor Bacteroidetes (Sphingobacteriales) F DNA-binding proteinActinobacteria(Actinobacteridae) F FtsX transmembrane transport protein Bacteroidetes (Sphingobacteriales) F hypothetical protein Actinobacteria;(Coriobacteriaceae)

3 out of 9 largest families have very low evenness value ( < 5) short chain dehydrogenaseacyl-CoA synthetase two-component system response regulator (LuxR) HalobacteriaHalorhabdus_utahensis 55HalobacteriaHalomicrobium_mukohataei 54HalobacteriaHalogeometricum_borinquense 53AminanaerobiaThermanaerovibrio_acidaminovorans 52DeferribacteresDethiosulfovibrio_peptidovorans 51DeinococciMeiothermus_silvanus 50DeinococciMeiothermus_ruber 49ChloroflexiThermobaculum_terrenum 48ChloroflexiSphaerobacter_thermophilus 47ActinobacteriaConexibacter_woesei 46ActinobacteriaAtopobium_parvulum 45ActinobacteriaSlackia_heliotrinireducens 44ActinobacteriaEggerthella_lenta 43ActinobacteriaCryptobacterium_curtum 42ActinobacteriaAcidimicrobium_ferrooxidans 41ActinobacteriaKribbella_flavida 40ActinobacteriaCatenulispora_acidiphila 39ActinobacteriaStackebrandtia_nassauensis 38ActinobacteriaGeodermatophilus_obscurus 37ActinobacteriaNakamurella_multipartita 36ActinobacteriaActinosynnema_mirum 35ActinobacteriaSaccharomonospora_viridis 34ActinobacteriaTsukamurella_paurometabola 33ActinobacteriaGordonia_bronchialis 32ActinobacteriaStreptosporangium_roseum 31ActinobacteriaThermobispora_bispora 30ActinobacteriaThermomonospora_curvata 29ActinobacteriaNocardiopsis_dassonvillei 28ActinobacteriaKytococcus_sedentarius 27ActinobacteriaBrachybacterium_faecium 26ActinobacteriaBeutenbergia_cavernae 25ActinobacteriaCellulomonas_flavigena 24ActinobacteriaXylanimonas_cellulosilytica 23ActinobacteriaJonesia_denitrificans 22ActinobacteriaSanguibacter_keddieii 21FirmicutesAnaerococcus_prevotii 20FirmicutesAlicyclobacillus_acidocaldarius 19FirmicutesVeillonella_parvula 18FirmicutesDesulfotomaculum_acetoxidans 17FusobacteriaSebaldella_termitidis 16FusobacteriaLeptotrichia_buccalis 15FusobacteriaStreptobacillus_moniliformis 14SpirochaetesBrachyspira_murdochii 13BacteroidetesPlanctomyces_limnophilus 12BacteroidetesRhodothermus_marinus 11BacteroidetesCapnocytophaga_ochracea 10BacteroidetesChitinophaga_pinensis 09BacteroidetesPedobacter_heparinus 08BacteroidetesSpirosoma_linguale 07BacteroidetesDyadobacter_fermentans 06EpsilonproteobacteriaSulfurospirillum_deleyianum 05DeferribacteresDenitrovibrio_acetiphilus 04DeltaproteobacteriaHaliangium_ochraceum 03DeltaproteobacteriaDesulfomicrobium_baculatum 02DeltaproteobacteriaDesulfohalobium_retbaense 01GammaproteobacteriaKangiella_koreensis 50

phylum specific family 26/56 Actinobacteria Gene numberFrom Actinobacteria by chance

712 families (size >=10) are phylum specific Family size Organism number

Family sizeActonobacteriaBacteroidetesDeinococciFirmicutesFusobacteriaHalobacteria 10<= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x < <= x Phylum-specific families from more than two organisms

F2699 Bacteroidetes=303; outer membrane protein *F2752 Actinobacteria=160; RNA polymerase, sigma-24 subunit, ECF family F2772 Bacteroidetes=147; putative ECF-type RNA polymerase sigma factor F2801 Actinobacteria=129; DNA-binding protein F2827 Bacteroidetes=114; FtsX-related transmembrane transport protein F2867 Actinobacteria=103; unknown functions The largest 6 phylum-specific families * From 15 organisms

Novel gene families: None of the genes in a family has a Genbank hit (e cutoff: 1e-5)

Streptococcus agalactiae “pan-genome” Tettelin H. et.al. PNAS 2005;102:

217,079 genes from 53 GEBA Bacterial genomes familiesN genomes Number of families with the selected genomes A:N from1 to 53 B:For every N, sample the families 100 times

Bacteria from GEBA project Genome Number Gene Family Number (including families with single members) Number of Genomes New Genome families

Actinobacteria: (73 genomes, including 26 GEBA genomes) Streptococcus agalactiae (8 strains) Enterobacteriaceae: (40 genomes) 9Escherichia coli 7Yersinia pestis 6Salmonella enterica 3Shigella flexneri Bacteria: (53 GEBA genomes)

Bacteria from GEBA project Genome Number Gene Family Number (including families with single members)

Genome Number Total Gene Number

S. agalactiae Enterobacteriaceae Actinobacteria Bacteria from GEBA project Total Gene Number Gene Family Number

Calculate the PD (Phylogenetic Diversity) Of a sub-tree

Bacteria from GEBA project Genome Number Phylogenetic Diversity

Bacteria from GEBA project Phylogenetic Diversity Gene Family Number

How far down the road GEBA has to go in terms of PD coverage Bacterial/Archaeal ss-rRNA from Greengenes clusters MCL99% Identity at 80% span Greengenes Bacterial/Archaeal ss-rRNA 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA Retrieve alignments from greengenes QuickTree Distant Tree for all representatives Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics Trim by the greengenes mask

74437 non-environmental Bacterial/Archaeal ss-rRNA from Greengenes clusters MCL99% Identity at 80% span 9946 Greengenes Bacterial/Archaeal ss-rRNA 667 Combo Bacterial ss-rRNA 50 Combo Archaeal ss-rRNA 56 GEBA ss-rRNA Retrieve alignments from greengenes QuickTree Distant Tree for non-environmental representatives Filter out ss-rRNA from Genome Porjects 99% identity cutoffs Filter out low-quality sequences short sequences <=1200nt low-quality sequences duplicates chimerics Trim by the greengenes mask

GEBA Pre-GEBA Greengenes *start from Haemophilus influenzae Rd KW20 **In each group, the taxa are sorted by their PD contributions in descending order

GEBA genomes pre-GEBA genomes Organisms from the greengenes database (excluding environmental samples) Organism Numbers Phylogenetic Diversity

The slopes of the linear regression Lines represent the PD contribution of the genomes (each window contains 50 genomes)

Only the top 150 PD contributors out of 717 pre-GEBA genomes have an average PD contribution greater than the GEBA genomes. The genome sequencing efforts have only covered 11.5% phylogenetic diversity to date in this study. We can pick an additional 550 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To increase PD coverage to 50%, we need to sequence at least 1520 more genomes Non-environmental Tree

All-representative Tree Current genome sequences only cover 2.2% of the PD We can pick an additional 4400 organisms and still have an average PD contribution greater than or equal to the 56 GEBA genomes To cover 50% of the phylogenetic diversity, we have to sequences 9218 more genomes

rbcL

rbcL Active sites Catalytic RuBP binding

Glycerate-3-P P-glyceroyl-P GAPDHAPFructose-1,6-P Fructose-6-P Xylulose-P Ribulose-5-P Ribulose-1,5-P CO 2 rbcL pgk gap tpiA glpX tktA rpe Calvin cycle

OrganismphylumrpeprkrbcLrbcSpgk Thermomonospora_curvata_DSM_43183ActinobacteriaxxIxx Meiothermus_silvanus_DSM_0994DeinococcixxI,IVxx Acidimicrobium_ferrooxidansActinobacteriaxxIxx *Halogeometricum_borinquense_DSM_11551HalobacteriaxIIIx Halomicrobium_mukohataei_DSM_12286HalobacteriaxIIIx Alicyclobacillus_acidocaldarius_subspFirmicutesxxIVx Meiothermus_ruber_DSM_01279DeinococcixxIVx Nakamurella_multipartita_DSM_44233ActinobacteriaxxIV Planctomyces_limnophilus_DSM_03776BacteroidetesxIVx Rhodothermus_marinus_DSM_4252BacteroidetesxxIVx Veillonella_parvula_DSM_02008FirmicutesxIVx Geodermatophilus_obscurus_DSM_43160ActinobacteriaxxVx Pedobacter_heparinus_DSM_02366BacteroidetesxxVx Dyadobacter_fermentans_DSM_18053BacteroidetesxxVx Calvin Cycle * Finished genome