Presentation is loading. Please wait.

Presentation is loading. Please wait.

Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.

Similar presentations


Presentation on theme: "Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based."— Presentation transcript:

1 Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based upon studies of sequences of the ribosomal RNA gene sequences, especially those for the small-subunit rRNA (ss-rRNA). The current taxonomic classification of bacteria and archaea is also heavily based on ssu-rRNA. Despite the historical and current power of ssu-rRNA analysis, it does have some drawbacks including copy number variation among organism and complications introduced by horizontal gene transfer, convergent evolution, or evolution rate variations. Fortunately, genome sequencing and metagenomic sequencing are providing a wealth of information about other genes in the genomes of various bacteria and archaea. By analyzing complete genome sequences in the IMG database, we have identified 40 protein-coding genes with strong potential as broad phylogenetic markers across bacteria and archaea (e.g., they are highly universal, have low variation in copy number, and have relatively congruent phylogenetic trees). We report here the development and use of methods to make use of these 40 phylogenetic marker genes for operational taxonomic unit assignment and taxonomic classification of bacteria and archaea. Our method allows one to place an organism into a specific taxonomic group at various taxonomic levels while accounting for differences in rates of evolution between taxa and between genes. We compare the OTUs and taxonomic classifications for these protein coding marker genes with OTUs and classifications based on phylogenetic trees of ss-rRNA and those from sequence clustering (non phylogenetic) methods. Our analysis demonstrates that, at the species level, phylogenetic tree-based methods examining these 40 protein coding genes identify OTUs that are comparable to ss- rRNA sequence similarity based OTUs. Our phylogenetic tree based taxonomic classifications of IMG genomes at the genus, order, family, class, phylum levels will be discussed. Methods: 1. Measurement of the position of a node in a rooted phylogenetic tree Phylogenetic Tree Based Taxonomic Classification Dongying Wu* 1,2, Jonathan A. Eisen 2 1. DOE Joint Genome Institute, Walnut Creek, California 94598, USA 2. University of California, Davis, Davis, California 95616, USA 2. Identify OTUs based on a phylogenetic tree Position Values of nodes (PN) in a tree are calculated, an edges is cut if the PN value of the node closer to the root is larger than a cutoff (value of 0 to 1). The leaves under such a edge define one OTU. The sequence of node and edge evaluation is illustrated in Figure 2. 99,000 random sampling of the 74,789,356 pairs Figure 1. An example of PN (Position of a node) calculation. Figure 2. OTU (operational taxonomic unit) identification based on PN of the nodes in a phylogenetic tree. Figure 4. Comparison of IMG taxonomic annotation with OTUs generated from the IMG genome tree at different PN (position of the node) cutoffs. IMG genome tree was build upon the concatenated alignments of 38 phylogenetic markers by Fasttree. Different PN cutoffs for tree-based OTUs generation are corresponding to different levels of IMG’s current taxonomic classification. (1)(2) (4)(3) The normalized distance of a node to the leaves of its sub-tree is used to measure the position of a node in a rooted phylogenetic tree (PN, Position of a Node). PN is defined by equation (1): Rn is the distance of the node to the tree root, Dn is the distance of the node to the leaves of its sub-tree. Dn is defined by equation (2): Di is the distance between leaf i and the node, Pi is the phylogenetic contribution of leaf i to the sub-tree defined by the node. Di and Pi are defined in equation (3) and (4): Li is the length of the edge connects leaf i to its parent node, m is a node between leaf i and the node that equation (1) measures, Vm is the the length of the edge connects node m and its parent, Cm is the number of leaves in the sub-tree defined by node m. D A = 1.0 + 1.2 = 2.2 P A = 1.0 + 1.2/2 = 1.6 D B = 2.0 + 1.2 = 3.2 P B = 2.0 + 1.2/2 = 2.6 D C = 3.0 P C = 3.0 A B C N2 N1 ROOT D 1.0 2.0 3.0 1.2 1.8 D N2 = P A x D A + P B x D B + P A x D B P A + P B + P A = 2.9 PN N2 = 2.9/(1.8+2.9) = 0.62 Mark All Leaves “CURRENT” Calculate PN values of the parent nodes of nodes/leaves marked “CURRENT” Is PN value of a parent node larger than the input cutoff? Remove the all nodes that defines no sub-trees, and change the node/leaf label from “CURRENT” to “PROCESSED” Identify nodes with all the nodes and leaves in their sub- tree marked “PROCESSED”, and mark them “CURRENT” Yes Cut the edge below the parent node into one OTU 3. Compare two sets of OTUs Adjusted mutual information (AMI) is used to compare two sets of OTUs. X and Y are two clusters of OTUs. (1) (2) (3) The adjusted mutual information (AMI) between cluster X and cluster Y is calculated by equation (1). H(X), H(Y), H(X,Y) is the entropies of X, Y and their joint cluster calculated by equation (2). I(X;Y) is the mutual information between cluster X and Y defined by equation (3). E is the average mutual information of 100 comparison between randomized X and Y using the “permutation randomization model”. 4. Phylogenetic tree building Peptide sequences of 40 phylogenetic markers genes were retrieved from the bacterial and archaeal genomes in the IMG database. The 40 genes include: ribosomal protein S2, S10, L1, L22, L4, L2, S9, L3, L14, S5, S19, S7, L16, S13, L15, L25/L23, L6, L11, L5, S12/S23, L29, S3, S11, L10, S8, L18, S15, S17, L13 and L24; translation elongation factor EF-2; translation initiation factor IF-2; Metalloendopeptidase; ffh signal recognition particle protein; phenylalanyl-tRNA synthetase beta subunit, alpha subunit; tRNA pseudouridine synthase B; Porphobilinogen deaminase; phosphoribosylformylglycinamidine cyclo-ligase; ribonuclease HII. Alignments were built by MUSCLE and phylogenetic trees were built by Fasttree. Alignments of 38 markers were concatenated and a tree was built by Fasttree (excluded Porphobilinogen deaminase and phosphoribosylformylglycinamidine cyclo-ligase). Small subunit rRNA sequences from the IMG database were aligned through SINA server. Alignments and a raxml tree of ssu-rRNA were retrieved from the “all-species living tree project” at the SILVA database. AMI compared to the mothur OTUs (cutoff 0.03) PN cutoffs for OTU identification from the SILVA raxml tree Figure 3. Adjusted mutual information (AMI) between OTUs (operational taxonomic unit) generated by MOTHUR at a cutoff of 0.03 and OTUs generated from the raxml 16S tree at different PN (position of the node) cutoffs. The distances for MOTHUR OTU classification was base on the same alignments that the phylogenetic tree was built upon, both were retrieved from the “all-species living tree project” at the SILVA database. The PN cutoff of 0.04 defines species in this tree. Results and Discussion AMI compared to IMG taxonomic grouping PN cutoffs for OTU identification from the IMG concatenated 38 marker tree Concatenated 38 markers ss-rRNA tree ss-rRNA mothur AMI Concatenated 38 markers Ribosomal protein S2 Ribosomal protein S10 Ribosomal protein L1 Concatenated 38 markers FliL CobS CobW AMI Figure 5. Comparison of OTUs generated from the IMG genome tree, IMG ssu-rRNA and sequences similarity based OTUs (MOTHUR) at different cutoffs. IMG genome tree yields OTUs that are comparable to those built from ssu-rRNA tree and MOTHUR. Figure 6. Comparison of OTUs generated from the IMG genome tree, ribosomal protein S2, S10 and L1 trees. Our results indicate that it is feasible to compare OTUs building from phylogenetic trees of different marker genes. Figure 7. Comparison of OTUs generated from the IMG genome tree, Flagellar protein FliL, Vitamin B12 synthesis protein CobS and CobW. Only single-copied FliL, CobS and CobW were included in the analysis. Out study demonstrates that FliL and CobS have co-evolved with phylogenetic marker genes such as ribosomal protein coding genes and ss-rRNA, while the evolving history of CobW is less clear. AMI


Download ppt "Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based."

Similar presentations


Ads by Google