Molecular Phylogenetics (part 1 of 2)

Molecular Phylogenetics (part 1 of 2)
Computational Biology Course João André Carriço

Charles Darwin ( ) Charles Darwin ‘s “tree of life” in Notebook B,

Ernst Haeckel ( ) German biologist, naturalist, philosopher, physician, professor, and artist

Many trees…how to choose one?
How to construct the tree ? How to interpret the tree?

The Natural History of Middle-Earth

Phylogenetics is all about trees!
Sequoia sempervirens, Muir Woods National Monument

Molecular Phylogenetics
phylé/phylon genetikós Tribe /Clan /Race Origin /Source /Birth

Bacteria : asexual reproduction
Bacteria Genetic Material: One circular chromossome Plasmids (optional) Transposons/Insertion Sequences Phages Generation 0 Generation 1 Generation 2 Reproduction by Binary fission

Mutation Mutations are usually single nucleotide changes in the genome . For instance if you have the following DNA sequence: ..ATTTGCCGGATTC… it can suffer a mutation to … ..ATTTACCGGATTC… (Changed G->A at the 4th position) ….ATGGGCGGCTACTTTAC… ….ATGGGCGTCTACTTTAC…

Indel (Insertion / Deletion) Mutations are usually single nucleotide changes in the genome . For instance if you have the following DNA sequence: ..ATTTGCCGGATTC… it can suffer a mutation to … ..ATTTACCGGATTC… (Changed G->A at the 4th position) ….ATGGGCGGCTACTTTAC… ….ATGGGCG_CTACTTTTAC…

Recombination + = ….ATGGG…ACGCTCC…TTTACTTCCG… ….ATGGG…CGTCTAC…TTTATCCGTA… ….ATGGG…ACGCTCC…TTTATCCGTA… Recombination events occur when fragments of DNA sequences manage to be inserted in the genome. As an example: In the genome: …GAGAAAAATTTGCCGGATTCTTGGCGCTAAAAGGTTTTGCGC… “Foreign” DNA that the bacteria up took TTTTCGATTTGAAAGGCTTGGTGGCCCCAAAGGTAATCC… As you can see part of the sequence are identical (ATTTG on the left and AAAGGT on the right ) so the foreign DNA could eventually recombine generating the following sequence: …GAGAAAAATTTGAAAGGCTTGGTGGCCCCTTTTGCGC…

Phylogeny 1 2 3 4 5 6 7 8 9 10 11 12 13 recombination
This tree could represent 4 generations: the first generation has the bacterial strain with ID 1 that generated in the second generation strains 2 and 3 and so on… The arrow from strain 6 to strain 5 represents a recombination event that changed the part of the genome of strain 6 to an identical sequence of the genome of strain 5. It is assumed that it is a recombination event and not a mutation since the odds of a mutation happening in the same place are very low. 8 9 10 11 12 13 recombination

Sexual reproduction Mendelian tree

What is represented by a phylogenetic tree?
Individuals We start with individuals with some genotypic and phenotypic characteristics…. Pedigree

Individuals The “zooming out” from individuals and pedigree will let us see the the bigger picture of the whole population Population

Populations can be isolated for some period of time, but on evolutionary timescales, migration of individuals occur among different populations. The gene flow between populations has the effect of joining the different populations into a Species Population Species

During long times, lineages tend to split: Migration to new and isolated region - Founder effect A contiguous range can be split by geological or climatic events - Vicariance That will lead to isolation of the lineages of a species, due to barriers to genetic flow and eventually could lead to speciation. Species Time Phylogeny

Concepts in Phylogeny Terminal / Taxa / Leaves / external node
bacteria birds marsupials Homo branch Node / internal node Root

Concepts in Phylogeny Monophylectic group = Clade
A Clade contains an ancestral lineage and all the descendants of that ancestor. Can be separated from the root with one single cut. bacteria birds marsupials Homo

Concepts in Phylogeny Unrooted vs Rooted Trees A C B A C A B C A B B B

Concepts in Phylogeny Unrooted Trees compare features of a group of organisms (example: 16S RNA in bacteria).It illustrates their relatedness without making assumptions about ancestry. Rooted trees usually use an outlier individual or species (outgroup) to root the tree. That allows for each node with descendants to represent the inferred most recent common ancestor of the descendants, and the edge lengths in some trees may be interpreted as time estimates.

Concepts in Phylogeny The information on patterns of evolutionary descent is the same regardless of the lengths of branches. If the branch length has some meaning it is usually represented. These trees depict equivalent relationships despite being different in style.

Isoleucine – Serine – Arginine – Glutamic acid … -…– Methionine- … … Arginine… … -Serine-…-… … -…-…- Lysine … Glycine… Vs Aminoacids tRNA AUC UCA AGG GAA AUC UCA AUG GAA AUC UCA AGA GAA Mutation AUC UCG AGG GAA AUC UCA AGG AAA AUC UCA GGG GAA AUC UCA GGA GAA

Concepts in Phylogeny All these trees are the same E D C B A A E D C B

Concepts in Phylogeny E D C B A E D C B A Bifurcating trees
Multifurcating trees

Concepts in Phylogeny A Synapomorphy or Synapomorphic character are used to derive clade definition or confirmation A Homoplasy is a trait shared by two or more organisms but not present in the common ancestor. Can occur due to convergent evolution or due to horizontal gene transfer

Analysis of hereditary molecular differences, mainly in DNA/RNA or Protein sequences, in order to infer the evolutionary relationships of a group of organisms

DNA vs Protein What to choose for an analysis ?
It will be dependent on the level of evolutionary relationship being investigated. When analyzing closely related individuals, DNA will be more informative. When analyzing deeper evolutionary relationships, Proteins change more slowly and therefore can reveal long term relationships

Sequence-based phylogenetic analysis
Select a sequence of interest: Whole gene, region of a gene (coding or non-coding), regulatory region of a gene, transposable elements or even a whole genome Identify homologs: Search or acquire data that are homologous to the sequence of interest Align the sequences: Align all the homologous regions to generate a sequence data matrix Calculate the phylogeny based on the alignment

1) Selecting a sequence of interest
It will depend on the study to be performed Any kind of sequence (coding vs non coding) can be compared Can be more than one sequence i.e. different loci (genes or part of genes) There are always issues that can hinder the choice, and no single type of sequence is perfect for all purposes. The decision should be made based on objective criteria, that could be a convenience one (easier /cheaper to clone or to sequence)

1) Selecting a sequence of interest
Example: Use of small subunit ribossomal RNA (ss-rRNA) 16S for studies of microbial evolution: Highly conserved between species: one set of primers can be used to amplify the gene from most of bacteria or archaea species Can be used to study ancient evolution(ex: archaea vs. bacteria) and more recent evolution (ex: Escherichia vs. Salmonella) Limitation : unrelated thermophiles converge on high G+C content in rRNA, which lead to problems in accuracy of inferred phylogenies Limitation: different rates of evolution of rRNA between species, which are different from coding genes Limitation: can’t discriminate well within some genera or species

2) Identifying Homologs
Homologous DNA Sequences (homologs): assumed to have a shared ancestry Orthologs : Result of a speciation event Paralogs: Result of a gene duplication . Example: hemoglobin genes A, A2,B and F Xenologs: Result from horizontal gene transfer Ohnologs : paralogs that originated by a process of whole genome duplication

2)Identifying Homologs
Obtaining the Homologous sequences: Sequencing : experimental generation of data “Two weeks in the lab can save you two hours in the library” Database searching: online databases are available with deposited DNA, RNA and protein sequences Query target sequence to database Search typically by BLAST algorithm Matches are given a score and cut-off are set to eliminate weak matches

2) Identifying Homologs
Database searches problems: A decision must be made among the matches as to which are true homologs and which are not. Similarity of sequence is not proof of homology! When searching large databases with short sequences , you can get some matches by chance alone. The sequence similarity could be due parallel or convergent evolution (homoplasy). Conservative approach: set a high similarity threshold to decide if they are homologs. Homology is always an inference

3) Sequence alignment Multiple Sequence Alignment is performed on the sequences. Remember that an alignment also represent an hypothesis! OTU= Operational Taxonomic Unit Each specific residue (amino acid or nucleotide base) will correspond to different states of a homologous trait This means that is inferred that the residues in one column have derived from a common residue in an ancestral sequence (Positional homology)

4) Creating the phylogeny
From the alignment, for most methods, an evolutionary distance is calculated. Those distances take into account the redundancy of the genetic code , properties of amino acids, etc, and not only the percentage of identity between sequences They aim to correct the difference between a true evolutionary distance and the calculated difference in residues This is due to the fact that we sample a finite number of traits, and the finite number of possible character states found in DNA and Protein sequence

Models of DNA Evolution
Jukes and Cantor (1969). (JC69) Simplest model of DNA evolution Assumptions: all substitutions are independent all sequence positions are equally subject to change Equal mutation rate among the four types of nucleotides no insertions or deletions have occurred Max p of 0.75 ! Proportion of different sites, between two sequences

Kimura (1980). (K80) (Kimura 2-parameter model Distinguishes between transitions (more frequent) and transversions Assumptions: All the bases are equally frequent p - Proportion of sites that show transitions q – proportion of sites that show transversions

Some of the other available models (in order of increasing complexity): Felsenstein (1981). (F81) Extends JC69 by allowing the base frequencies to vary Hasegawa, Kishino and Yano (HKY85) Combines K80 and F81. Also known as F84 since Felsenstein also produced an equivalent model in 1984 Tamura (1992). (T92) Extends K80 with accounting for G+C-content bias (ex: Drosophila mitochondrial DNA) Tamura and Nei (1993). (TN93) distinguishes between two types of transition (rate A<->G ≠rate C<->T

Limitations to Jukes–Cantor and Kimura-2 parameter (and others) : Assume base composition or amino acid composition is uniform and stationary over time. When this is not the case, these methods can produce distance matrices that lead to incorrect tree inference. Other correction methods are available in those cases.

Sequence-based phylogenetic analysis
Select a sequence of interest: Whole gene, region of a gene (coding or non-coding), regulatory region of a gene, transposable elements or even a whole genome Identify homologs: Search or acquire data that are homologous to the sequence of interest Align the sequences: Align all the homologous regions to generate a sequence data matrix Correct the distance using models of DNA evolution Calculate the phylogeny based on the alignment

Algorithms for tree construction
Based on the distance matrix: Hierarchical clustering methods: UPGMA, Single Linkage and Complete linkage Neighbor-joining Fitch-Margoliash method Maximum Parsimony methods Based on rules (Graphic Matroids) goeBURST Maximum Likelihood methods Bayesian inference methods

Hierarchical clustering

Cluster analysis Cluster: a collection/partition of data objects (taxa) in a dataset

How many “groups” ? Cluster analysis: partitioning of a dataset into clusters. Objects in the same cluster should be “more similar” to each other than objects from different clusters Unsupervised classification (no a priori definition of classes /clusters

How many “groups” ? Four steps: Choosing the characters to be measured
Choosing Similarity /Distance Metric (among cases/individuals/taxa) Inter group distance/Linkage calculation : UPGMA /Single Linkage/Complete Linkage /… 4) Cutting the dendrogram at certain levels to define clusters

Hierarchical clustering: WPGMA
Weighted Pair Group Method with Arithmetic Mean Sokal and Michener

Defining Clusters in a Dendrogram
B F D E A C Clusters: F, DE, B, AC Clusters: F, DE, BAC Clusters: F, DEBAC Which is the correct cut-off: Several methods have been developed based on distances of branches to the cut Aditional data or prior knowledge can help the decision

Assume this tree: Where the distance between taxa can be represented in the following distance matrix: A B C D E F 5 4 7 10 6 9 8 11 Cycle 1: Shortest distance is d(AC)=4 Join AC in subtree: A C 2 Branch length = d(AC)/2 =2

AC B D E F 6 7 10 9 5 8 11 Cycle 2: Shortest distance is d(DE)=5 A C 2 D E 2.5 Join AC in subtree: Branch length = d(DE)/2 =2.5 d((AC)B)=(d(AB)+d(CB))/2 = (5+7)/2=6 d((AC)D)=(d(AD)+d(CD))/2 = (7+7)/2=7 d((AC)E)=(d(AE)+d(CE))/2 = (6+6)/2=6 d((AC)F)=(d(AF)+d(CF))/2 = (8+8)/2=8 1 Cycle 3: Shortest distance is d((AC)B)=6 A C 2 AC B DE F 6 6.5 9.5 8 11 8.5 B 3 Branch length = d((AC)B)/2=3 D E 2.5 d((DE)B)=(d(DB)+d(EB))/2 = (10+9)/2=9.5 d((DE)F)=(d(DF)+d(EF))/2 = (9+8)/2=8.5 d((DE)(AC))=(d(D(AC))+d(E(AC))/2 = (7+6)/2=6.5

Cycle 4: Shortest distance is d((ACB)(DE))=8 1 A C 2 1 ACB DE F 8 9.5 8.5 B 3 Branch length = d((ACB)(DE))/2=4 D E 2.5 d((ACB)(F))=(d((AC)F)+d(BF))/2 = (8+11)/2=9.5 d((ACB)(DE))=(d((AC)(DE))+d(B(DE))/2 = ( )/2=8 1.5 Cycle 5: Shortest distance is d((ACBDE)F)=9 1 A C 2 1 ACBDE F 9 B d((ACBDE)F)=(d((ACB)F)+d((DE)F)/2 = ( )/2=9 0.5 3 Branch length = d((ACBDE)F))/2=4.5 D E 2.5 1.5 F 4.5

1 A C 2 1 B 0.5 3 D E 2.5 1.5 F 4.5 The resulting UPGMA tree differs in branch order Root assigned at mid-point

Hierarchical clustering
Advantages of hierarchical clustering methods: Speed (O (n2)) Limitations of UPGMA/WPGMA/Single Linkage/Complete Linkage: Assumption that the rates of evolutionary change are uniform in different evolutionary branches ( same molecular clock in different branches). Assumes an Ultrameric Tree, i.e., any 3 taxa {A,B,C} , DAC ≤max(DAB,DBC)

Hierarchical clustering: Single and Complete Linkage
Give two groups g1 and g2 Complete linkage The distance between groups is the distance between the farthest pair Pulls groups farther apart Single linkage The distance between groups is the distance between the closest pair Pulls groups closer together

UPGMA/Single Linkage /Complete Linkage
W/UPGMA Single Linkage Group average centroid

WPGMA vs UPGMA WPGMA UPGMA https://en.wikipedia.org/wiki/WPGMA

See you tomorrow… .... Or next on the TP

Molecular Phylogenetics (part 1 of 2)

Similar presentations

Presentation on theme: "Molecular Phylogenetics (part 1 of 2)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Molecular Phylogenetics (part 1 of 2)

Similar presentations

Presentation on theme: "Molecular Phylogenetics (part 1 of 2)"— Presentation transcript:

Similar presentations

About project

Feedback