Presentation on theme: "Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes."— Presentation transcript:
Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes
Species trees Interpretation of trees Taxon sampling Tools Lateral (horizontal) gene transfer Fast evolving genes
Using DNA sequence to construct trees TGCTATT TGCTTTT TGCTATT – ancestral DNA sequence TGCTTTT – sequence change due to mutation
Reversals can confuse phylogenies TGCTATT TGCTTTT TGCTATT – ancestral DNA sequence TGCTTTT – sequence change TGCTATT reversal
To minimise the effect of reversals Use DNA sequences that are evolving slowly – mutations happen rarely. Use long stretches of DNA. Align sequences, use the parts of the alignment that show a high degree of conservation. rDNA sequences (genes that encode ribosomal RNA) are often used.
Species tree constructed using ribosomal DNA (rDNA) sequence
Using protein sequences to create species trees Advantages – protein sequences evolve more slowly than DNA sequences (many DNA mutations are neutral – they do not change amino acid sequences) – reversals are less common than in DNA Single copy protein encoding genes identified Protein sequences joined together to create a multiple protein sequence for each species Sequences aligned Disadvantage – need sequenced genomes
basidiomycetes ascomycetes filamentous ascomycetes yeasts zygomycete 30 proteins 60 proteins Fungal species trees – more proteins = better resolution oomycete (not fungi) microsporidia plant
Fungal Species Tree (based on 153 concatenated protein sequences)
Clades A clade consists of an ancestor organism and all its descendants.
Gene trees The evolutionary history of genes can be represented as phylogenetic trees based on alignment of protein sequences. Gene duplication and loss can be inferred from phylogenetic trees. Protein sequences evolve more slowly that DNA sequences (due to redundancy in genetic code)
Gene duplication Gene duplication due to unequal crossing over during meiosis can create gene families. Sequence and function of different members of a gene family can diverge.
Sequence homology (1) Genes are said to be homologous if they share a common evolutionary ancestor. Orthologues are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologues retain the same function in the course of evolution. (e.g. myoglobin in mammals).
Sequence homology (2) Paralogous genes are related by duplication within a genome. Paralogues often evolve new functions, even if these are related to the original one. In-paralogues, paralogues that were duplicated after a speciation and are therefore in the same species Out-paralogues, paralogues that were duplicated before a speciation. Not necessarily in the same species.
Paralogues In-paraloguesOut-paralogues A, B and C are different species α and β are different paralogues of the same gene
Evolution of globin superfamily in human lineage
TOR gene duplication events in fungi TOR: protein kinase, subunit of a complex that regulate cell growth in response to nutrient availability and cellular stresses
Taxon sampling methods BLAST easiest – though subjective Occurence of Pfam (protein family) motif Clustering e.g. – INPARANOID http://inparanoid.sbc.su.se/cgi- bin/index.cgihttp://inparanoid.sbc.su.se/cgi- bin/index.cgi – orthoMCL http://www.orthomcl.org/cgi- bin/OrthoMclWeb.cgihttp://www.orthomcl.org/cgi- bin/OrthoMclWeb.cgi
Minimum bootstrap 70% bootstrap is thought to be broadly similar to P-value 0.05 Minimum bootstrap used depends on study To improve bootstrap support – remove poorly aligned sequences if possible, can be due to mis-annotation of genomes. – Change taxon sampling
Collapse branches with bootstrap less than defined value
Genes that evolve quickly (1) Synonymous substitution – change in DNA sequence that does not affect the amino acid sequence, often in the third position of a codon, e.g. CCG (Pro)→CCA (Pro). Non-synonymous substitution - change in DNA sequence that does affect the amino acid sequence, often in the first or second position of a codon, e.g. CCG (Pro)→CAG (Gln).
Genes that evolve quickly (2) For a given protein encoding gene (comparison between orthologues in more than one species) dN=number of non-synonomous mutations dS=number of synonomous mutations We can calculate the ratio dN/dS. For most genes this is < 1 Genes under evolutionary pressure to change protein sequence (diversify), dN/dS > 1
Genes that evolve quickly (3) CodeML (part of the PAML package) will calculate dN/dS for a set of orthologues from different (closely related) species. Human vs Chimpanzee – rapidly evolving genes involved in immunity, reproduction and olfaction (smell). Genes with very low dN/dS (under purifying selection) involved in metabolism, intracellular signalling, nerve / brain function.