MCB 372 #14: Student Presentations, Discussion, Clustering Genes Based on Phylogenetic Information J. Peter Gogarten University of Connecticut Dept. of.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
ATPase dataset -> nj in figtree. ATPase dataset -> muscle -> phyml (with ASRV)– re-rooted.
Self Organization of a Massive Document Collection
No similarity vs no homology If two (complex) sequences show significant similarity in their primary sequence, they have shared ancestry, and probably.
THE EVOLUTIONARY HISTORY OF BIODIVERSITY
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Phylogenetic reconstruction
A Web Interface to analyse SOM of Bipartitions of Gene Phylogenies - A Walk Through J. Peter Gogarten, Maria Poptsova Dept. of Molecular and Cell Biology.
Self Organizing Maps. This presentation is based on: SOM’s are invented by Teuvo Kohonen. They represent multidimensional.
Molecular Clock I. Evolutionary rate Xuhua Xia
New Tools for Visualizing Genome Evolution Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island J. Peter Gogarten Dept. of Molecular.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
Trees and Sequence Space J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew.
Sequence alignment: Removing ambiguous positions: Generation of pseudosamples: Calculating and evaluating phylogenies: Comparing phylogenies: Comparing.
Bioinformatics and Phylogenetic Analysis
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Branches, splits, bipartitions In a rooted tree: clades (for urooted trees sometimes the term clann is used) Mono-, Para-, polyphyletic groups, cladists.
Example of bipartition analysis for five genomes of photosynthetic bacteria (188 gene families) total 10 bipartitions R: Rhodobacter capsulatus, H: Heliobacillus.
Finding Orthologous Groups René van der Heijden. What is this lecture about? What is ‘orthology’? Why do we study gene-ancestry/gene-trees (phylogenies)?
Cenancestor (aka MRCA or LUCA) as placed by ancient duplicated genes (ATPases, Signal recognition particles, EF) The “Root” strictly bifurcating no reticulation.
Trees as a Tool to Visualize Evolutionary History
Cenancestor (aka LUCA or MRCA) can be placed using the echo remaining from the early expansion of the genetic code. reflects only a single cellular component.
MCB 372 #12: Tree, Quartets and Supermatrix Approaches Collaborators: Olga Zhaxybayeva (Dalhousie) Jinling Huang (ECU) Tim Harlow (UConn) Pascal Lapierre.
Branches, splits, bipartitions In a rooted tree: clades Mono-, Para-, polyphyletic groups, cladists and a natural taxonomy Terminology The term cladogram.
Trees? J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Sculpture at Royal Botanical Gardens, Kew.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Classification and Phylogenies Taxonomic categories and taxa Inferring phylogenies –The similarity vs. shared derived character states –Homoplasy –Maximum.
Molecular phylogenetics
Coalescence and the Cenancestor J. Peter Gogarten University of Connecticut Department of Molecular and Cell Biology.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
Pollen transcript unigene identifier log 2 -fold change Annotation (BLAST) Unigene L. longiflorum chloroplast, complete genome Unigene
ATPase dataset -> nj in figtree. ATPase dataset -> muscle -> phyml (with ASRV)– re-rooted.
Phylogenetics and Coalescence Lab 9 October 24, 2012.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Gene expression analysis
Chapter 26 Phylogeny and the Tree of Life
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Phylogenetic analyses of cyanobacterial genomes: Quantification of horizontal gene transfer events Olga Zhaxybayeva, J. Peter Gogarten, Robert L. Charlebois,
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
TreeSOM :Cluster analysis in the self- organizing map Neural Networks 19 (2006) Special Issue Reporter 張欽隆 D
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
ATPase dataset from last Friday Alignment clustal vs muscle Conserved part are aligned reproducibly.
CUNY Graduate Center December 15 Erdal Kose. Outlines Define SOMs Application Areas Structure Of SOMs (Basic Algorithm) Learning Algorithm Simulation.
MCB 3421 class 26.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Bootstrap ? See herehere. Maximum Likelihood and Model Choice The maximum Likelihood Ratio Test (LRT) allows to compare two nested models given a dataset.Likelihood.
Phylogeny.
PHYOGENY & THE Tree of life Represent traits that are either derived or lost due to evolution.
Chapter 18 Phylogeny and the Tree of Life. Phylogeny u Phylon = tribe, geny = genesis or origin u The evolutionary history of a species or a group of.
The gradualist point of view Evolution occurs within populations where the fittest organisms have a selective advantage. Over time the advantages genes.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Phylogenetic reconstruction - How Distance analyses calculate pairwise distances (different distance measures, correction for multiple hits, correction.
Phylogeny and the Tree of Life
Big data classification using neural network
Self-Organizing Network Model (SOM) Session 11
The Coral of Life (Darwin)
Data Mining, Neural Network and Genetic Programming
Basics of Comparative Genomics
Methods of molecular phylogeny
Why could a gene tree be different from the species tree?
Gene expression analysis
Comments on bipartitions, quartets and supertrees
Self-organizing map numeric vectors and sequence motifs
Basics of Comparative Genomics
Presentation transcript:

MCB 372 #14: Student Presentations, Discussion, Clustering Genes Based on Phylogenetic Information J. Peter Gogarten University of Connecticut Dept. of Molecular and Cell Biology Edvard Munch, Dance on the shore (1900-2)

Drawing trees Treeview Tree edit NJPLOT ATV ITOL (discuss ToL ala Ciccarelli, and examples)

iToL Very Long Branch

GPX: A Tool for the Exploration and Visualization of Genome Evolution IEEE BIBE 2007 Harvard Medical School, Boston October 14-17, 2007 Neha Nahar, Lutz Hamel Department of Computer Science and Statistics University of Rhode Island Maria S. Popstova, J. Peter Gogarten Department of Molecular and Cell Biology University of Connecticut

Phylogenetics  Taxonomical classification of organisms based on how closely they are related in terms of their evolutionary differences.  Wikipedia:- (Greek: phylon = tribe, race and genetikos = relative to birth, from genesis = birth)Greek

Phylogenetic Data Number of genomes [n] Number of trees [2n-5]! / [2(n-3)(n- 3)!] Number of bipartitions [2 (n-1) -n-1] , ,075, E + 104, E E E E + 14

Analysis Of Phylogenetic Data Phylogenetic information present in genomes Break information into small quanta of information (bipartitions) Analyze spectra to detect transferred genes and plurality consensus. Dr. Peter Gogarten and Dr. Maria Popstova (UCONN)

Bipartitions  One of the methods to represent phylogenetic information.  Bipartition is division of a phylogenetic tree into two parts that are connected by a single branch.  Bootstrap support value: measure of statistical reliability Bootstrap support for the bipartitions - Bipartition 95 A B C D E

Why Bipartitions?  Number of bipartitions grow much slower than number of trees for increasing number of genomes.  Impossible computational task to iterate over all possible trees.  Bipartitions: easy and reliable.  Bipartitions can be compatible or conflicting. Compatible bipartitions: help find majority consensus. Conflicting bipartitions: related to horizontal gene transfer.

Representation Of Bipartition  Divides a dataset into two groups, but it does not consider the relationships within each of the two groups.  Non-trivial bipartitions for N genomes is equal to 2 (N-1) -N **… *..*. Number of genomes Number of bipartitions

Compatible And Conflicting Bipartitions Bipartition compatible to * *... is * * *.. Non-trivial bipartitions Bipartition conflicting to * *... is *... *

Compatibility/Incompatibilty Between Bipartitions S 1 : *. *. *.. S 2 : * * *.... S 1 :. *. *. * * S 2 :... * * * * S 1 and S 2 are compatible if: S 1 U S 2 = S 1 or S 1 U S 2 = S 2 or S 1 U S 2 = S 1 or S 1 U S 2 = S 2.. *.... * U. *..... * =. * *.... *

Data Flow- Matrix Generation Download complete N genomes Select orthologous gene families Reciprocal best BLAST hit method Gene families where one genome represented by one gene STEP 1   Gene families selected BranchClust algorithm

Data Flow- Matrix Generation For each gene family align sequences For every gene family reconstruct Maximum Likelihood (ML) tree and generate 100 bootstrap samples For each gene family parse bipartition info for each bootstrapped tree Compose Bipartition matrix STEP 2  Bipartition Matrix generated Bipartition #1 …Bipartition #k Support value vector for gene family #1 BP 11 …BP 1k Support value vector for gene family #2 BP 21 …BP 2k … ……… Support value vector for gene family #m BP m1 …BP mk

Analysis: SOM  Self organizing maps: Neural network-based algorithm attempts to detect the essential structure of the input data based on the similarity between the points in the high dimensional space. 3D vector : one dimension for each of the color components (R,G,B) SOMn iterations Sample Data

SOM Grid  Topology: Rectangular.  Data matrix: n rows * k columns  Neuron: 2 parts Data vector that is same size as the input vector. Position on the grid (x, y).  Neighborhood: Adjacent neurons - (shown in red). k-dimensional data Best matching neuron Neighborhood Neuron 2-dimensional SOM grid

SOM Algorithm - Initialize the neurons. - Repeat - For each record r in the input dataset - Find a neuron on the map that is similar to record r. - Make the neuron look more like record r. - Determine the neighboring neurons and make them look more like record r. - End For. - Until Converged. SOM Regression Equations:  - learning rate  - neighborhood distance 1. 2.

Training SOM Bipartition #1 …Bipartition #k Support value vector for gene family #1 BP 11 …BP 1k Support value vector for gene family #2 BP 21 …BP 2k … ……… Support value vector for gene family #m BP m1 …BP mk Bipartition Matrix SOM Neural Elements k-dimensional data Best matching neuron Neighborhood Neuron U-matrix representation Dark coloration  Large distance between adjacent neurons. Light coloration  Small distance Light areas  Clusters Dark areas  Cluster separators

Emergent SOM  Objective: Reduce dimensionality of data. Look for cluster of genes which favor certain tree topologies.  Emergent SOM Use large number of neurons than expected number of clusters. Visualize inter-cluster and intra-cluster relationships. No a priori knowledge of how many clusters to expect. Cluster membership is not exclusive. Visually appealing representation. T. Kohonen, Self-organizing maps, 3rd ed. Berlin ; New York: Springer, 2001.

Gene Map

Cutoff For Bootstrap Support SOM maps for bipartition matrix generated from 14 archaea species A – 0% cutoffB – 70% cutoffC – 90% cutoff

Consensus Tree ATV tree viewer displays plurality consensus for selected clusters Select neurons , 28, and 42 and visualize tree

Strongly Supported Bipartition Halobacterium, Haloarcula and Methanosarcina groups together

Consensus Tree for Strongly Supported Bipartition This bipartition is in agreement with the small subunit ribosomal RNA phylogeny and the consensus calculated from the transcription and translation machinery. SOM map for bipartition 156 where Halobacterium, Haloarcula and Methanosarcina groups together

Conflicting Bipartition Bipartition 15 corresponds to a split where Archaeoglobus groups together with Methanosarcina. This is a bipartition that is in conflict with the consensus phylogeny of conserved genes.

Tool Characteristics  Online tool that performs bipartition visualization using SOM.  Generate cluster of gene families that have close phylogenetic signal.  Interactive reconstruction of consensus tree for any combination of clusters.  Report the strongly supported and conflicting bipartitions.  Facilitate user to upload bipartition matrix file.  Store results of analysis for each user.

Future Work  Future Work Explore Locally Linear Embedding (LLE) as opposed to SOM. Explore quartets as opposed to bipartitions. Use boundless maps to avoid border effects. LLE Quartet Toroid map

GPX on the web:  Results for the analysis of 14 archaeal genomes:  Link to upload your own data:

Coalescence – the process of tracing lineages backwards in time to their common ancestors. Every two extant lineages coalesce to their most recent common ancestor. Eventually, all lineages coalesce to the cenancestor. t/2 (Kingman, 1982) Illustration is from J. Felsenstein, “Inferring Phylogenies”, Sinauer, 2003

Coalescence of ORGANISMAL and MOLECULAR Lineages 20 lineages One extinction and one speciation event per generation One horizontal transfer event once in 10 generations (I.e., speciation events) RED: organismal lineages (no HGT) BLUE: molecular lineages (with HGT) GRAY: extinct lineages 20 lineages One extinction and one speciation event per generation One horizontal transfer event once in 10 generations (I.e., speciation events) RED: organismal lineages (no HGT) BLUE: molecular lineages (with HGT) GRAY: extinct lineages RESULTS: Most recent common ancestors are different for organismal and molecular phylogenies Different coalescence times Long coalescence time for the last two lineages RESULTS: Most recent common ancestors are different for organismal and molecular phylogenies Different coalescence times Long coalescence time for the last two lineages Time

Adam and Eve never met  Albrecht Dürer, The Fall of Man, 1504 Mitochondrial Eve Y chromosome Adam Lived approximately 50,000 years ago Lived 166, ,000 years ago Thomson, R. et al. (2000) Proc Natl Acad Sci U S A 97, Underhill, P.A. et al. (2000) Nat Genet 26, Cann, R.L. et al. (1987) Nature 325, 31-6 Vigilant, L. et al. (1991) Science 253, The same is true for ancestral rRNAs, EF, SRP, ATPases!

EXTANT LINEAGES FOR THE SIMULATIONS OF 50 LINEAGES Modified from Zhaxybayeva and Gogarten (2004), TIGs 20,

The Coral of Life (Darwin)

Bacterial 16SrRNA based phylogeny (from P. D. Schloss and J. Handelsman, Microbiology and Molecular Biology Reviews, December 2004.) The deviation from the “long branches at the base” pattern could be due to under sampling an actual radiation due to an invention that was not transferred following a mass extinction