Presentation is loading. Please wait.

Presentation is loading. Please wait.

Molecular Phylogenetics (part 1 of 2)

Similar presentations


Presentation on theme: "Molecular Phylogenetics (part 1 of 2)"— Presentation transcript:

1 Molecular Phylogenetics (part 1 of 2)
Computational Biology Course João André Carriço

2 Charles Darwin ( ) Charles Darwin ‘s “tree of life” in Notebook B,

3 Ernst Haeckel ( ) German biologist, naturalist, philosopher, physician, professor, and artist

4

5

6

7

8

9 Many trees…how to choose one?
How to construct the tree ? How to interpret the tree?

10 The Natural History of Middle-Earth

11 Phylogenetics is all about trees!
Sequoia sempervirens, Muir Woods National Monument

12 Molecular Phylogenetics
phylé/phylon genetikós Tribe /Clan /Race Origin /Source /Birth

13 Bacteria : asexual reproduction
Bacteria Genetic Material: One circular chromossome Plasmids (optional) Transposons/Insertion Sequences Phages Generation 0 Generation 1 Generation 2 Reproduction by Binary fission

14 Bacteria : asexual reproduction
Mutation Mutations are usually single nucleotide changes in the genome . For instance if you have the following DNA sequence: ..ATTTGCCGGATTC… it can suffer a mutation to … ..ATTTACCGGATTC… (Changed G->A at the 4th position) ….ATGGGCGGCTACTTTAC… ….ATGGGCGTCTACTTTAC…

15 Bacteria : asexual reproduction
Indel (Insertion / Deletion) Mutations are usually single nucleotide changes in the genome . For instance if you have the following DNA sequence: ..ATTTGCCGGATTC… it can suffer a mutation to … ..ATTTACCGGATTC… (Changed G->A at the 4th position) ….ATGGGCGGCTACTTTAC… ….ATGGGCG_CTACTTTTAC…

16 Bacteria : asexual reproduction
Recombination + = ….ATGGG…ACGCTCC…TTTACTTCCG… ….ATGGG…CGTCTAC…TTTATCCGTA… ….ATGGG…ACGCTCC…TTTATCCGTA… Recombination events occur when fragments of DNA sequences manage to be inserted in the genome. As an example: In the genome: …GAGAAAAATTTGCCGGATTCTTGGCGCTAAAAGGTTTTGCGC… “Foreign” DNA that the bacteria up took TTTTCGATTTGAAAGGCTTGGTGGCCCCAAAGGTAATCC… As you can see part of the sequence are identical (ATTTG on the left and AAAGGT on the right ) so the foreign DNA could eventually recombine generating the following sequence: …GAGAAAAATTTGAAAGGCTTGGTGGCCCCTTTTGCGC…

17 Phylogeny 1 2 3 4 5 6 7 8 9 10 11 12 13 recombination
This tree could represent 4 generations: the first generation has the bacterial strain with ID 1 that generated in the second generation strains 2 and 3 and so on… The arrow from strain 6 to strain 5 represents a recombination event that changed the part of the genome of strain 6 to an identical sequence of the genome of strain 5. It is assumed that it is a recombination event and not a mutation since the odds of a mutation happening in the same place are very low. 8 9 10 11 12 13 recombination

18 Sexual reproduction Mendelian tree

19 What is represented by a phylogenetic tree?
Individuals We start with individuals with some genotypic and phenotypic characteristics…. Pedigree

20 What is represented by a phylogenetic tree?
Individuals The “zooming out” from individuals and pedigree will let us see the the bigger picture of the whole population Population

21 What is represented by a phylogenetic tree?
Populations can be isolated for some period of time, but on evolutionary timescales, migration of individuals occur among different populations. The gene flow between populations has the effect of joining the different populations into a Species Population Species

22 What is represented by a phylogenetic tree?
During long times, lineages tend to split: Migration to new and isolated region - Founder effect A contiguous range can be split by geological or climatic events - Vicariance That will lead to isolation of the lineages of a species, due to barriers to genetic flow and eventually could lead to speciation. Species Time Phylogeny

23 Concepts in Phylogeny Terminal / Taxa / Leaves / external node
bacteria birds marsupials Homo branch Node / internal node Root

24 Concepts in Phylogeny Monophylectic group = Clade
A Clade contains an ancestral lineage and all the descendants of that ancestor. Can be separated from the root with one single cut. bacteria birds marsupials Homo

25 Concepts in Phylogeny Unrooted vs Rooted Trees A C B A C A B C A B B B

26 Concepts in Phylogeny Unrooted Trees compare features of a group of organisms (example: 16S RNA in bacteria).It illustrates their relatedness without making assumptions about ancestry. Rooted trees usually use an outlier individual or species (outgroup) to root the tree. That allows for each node with descendants to represent the inferred most recent common ancestor of the descendants, and the edge lengths in some trees may be interpreted as time estimates.

27 Concepts in Phylogeny The information on patterns of evolutionary descent is the same regardless of the lengths of branches. If the branch length has some meaning it is usually represented. These trees depict equivalent relationships despite being different in style.

28 Molecular Phylogenetics
Isoleucine – Serine – Arginine – Glutamic acid … -…– Methionine- … … Arginine… … -Serine-…-… … -…-…- Lysine … Glycine… Vs Aminoacids tRNA AUC UCA AGG GAA AUC UCA AUG GAA AUC UCA AGA GAA Mutation AUC UCG AGG GAA AUC UCA AGG AAA AUC UCA GGG GAA AUC UCA GGA GAA

29 Concepts in Phylogeny All these trees are the same E D C B A A E D C B

30 Concepts in Phylogeny E D C B A E D C B A Bifurcating trees
Multifurcating trees

31 Concepts in Phylogeny A Synapomorphy or Synapomorphic character are used to derive clade definition or confirmation A Homoplasy is a trait shared by two or more organisms but not present in the common ancestor. Can occur due to convergent evolution or due to horizontal gene transfer

32 Molecular Phylogenetics
Analysis of hereditary molecular differences, mainly in DNA/RNA or Protein sequences, in order to infer the evolutionary relationships of a group of organisms

33 DNA vs Protein What to choose for an analysis ?
It will be dependent on the level of evolutionary relationship being investigated. When analyzing closely related individuals, DNA will be more informative. When analyzing deeper evolutionary relationships, Proteins change more slowly and therefore can reveal long term relationships

34 Sequence-based phylogenetic analysis
Select a sequence of interest: Whole gene, region of a gene (coding or non-coding), regulatory region of a gene, transposable elements or even a whole genome Identify homologs: Search or acquire data that are homologous to the sequence of interest Align the sequences: Align all the homologous regions to generate a sequence data matrix Calculate the phylogeny based on the alignment

35 1) Selecting a sequence of interest
It will depend on the study to be performed Any kind of sequence (coding vs non coding) can be compared Can be more than one sequence i.e. different loci (genes or part of genes) There are always issues that can hinder the choice, and no single type of sequence is perfect for all purposes. The decision should be made based on objective criteria, that could be a convenience one (easier /cheaper to clone or to sequence)

36 1) Selecting a sequence of interest
Example: Use of small subunit ribossomal RNA (ss-rRNA) 16S for studies of microbial evolution: Highly conserved between species: one set of primers can be used to amplify the gene from most of bacteria or archaea species Can be used to study ancient evolution(ex: archaea vs. bacteria) and more recent evolution (ex: Escherichia vs. Salmonella) Limitation : unrelated thermophiles converge on high G+C content in rRNA, which lead to problems in accuracy of inferred phylogenies Limitation: different rates of evolution of rRNA between species, which are different from coding genes Limitation: can’t discriminate well within some genera or species

37 2) Identifying Homologs
Homologous DNA Sequences (homologs): assumed to have a shared ancestry Orthologs : Result of a speciation event Paralogs: Result of a gene duplication . Example: hemoglobin genes A, A2,B and F Xenologs: Result from horizontal gene transfer Ohnologs : paralogs that originated by a process of whole genome duplication

38 2)Identifying Homologs
Obtaining the Homologous sequences: Sequencing : experimental generation of data “Two weeks in the lab can save you two hours in the library” Database searching: online databases are available with deposited DNA, RNA and protein sequences Query target sequence to database Search typically by BLAST algorithm Matches are given a score and cut-off are set to eliminate weak matches

39 2) Identifying Homologs
Database searches problems: A decision must be made among the matches as to which are true homologs and which are not. Similarity of sequence is not proof of homology! When searching large databases with short sequences , you can get some matches by chance alone. The sequence similarity could be due parallel or convergent evolution (homoplasy). Conservative approach: set a high similarity threshold to decide if they are homologs. Homology is always an inference

40 3) Sequence alignment Multiple Sequence Alignment is performed on the sequences. Remember that an alignment also represent an hypothesis! OTU= Operational Taxonomic Unit Each specific residue (amino acid or nucleotide base) will correspond to different states of a homologous trait This means that is inferred that the residues in one column have derived from a common residue in an ancestral sequence (Positional homology)

41 4) Creating the phylogeny
From the alignment, for most methods, an evolutionary distance is calculated. Those distances take into account the redundancy of the genetic code , properties of amino acids, etc, and not only the percentage of identity between sequences They aim to correct the difference between a true evolutionary distance and the calculated difference in residues This is due to the fact that we sample a finite number of traits, and the finite number of possible character states found in DNA and Protein sequence

42 Models of DNA Evolution
Jukes and Cantor (1969). (JC69) Simplest model of DNA evolution Assumptions: all substitutions are independent all sequence positions are equally subject to change Equal mutation rate among the four types of nucleotides no insertions or deletions have occurred Max p of 0.75 ! Proportion of different sites, between two sequences

43 Models of DNA Evolution
Kimura (1980). (K80) (Kimura 2-parameter model Distinguishes between transitions (more frequent) and transversions Assumptions: All the bases are equally frequent p - Proportion of sites that show transitions q – proportion of sites that show transversions

44 Models of DNA Evolution
Some of the other available models (in order of increasing complexity): Felsenstein (1981). (F81) Extends JC69 by allowing the base frequencies to vary Hasegawa, Kishino and Yano (HKY85) Combines K80 and F81. Also known as F84 since Felsenstein also produced an equivalent model in 1984 Tamura (1992). (T92) Extends K80 with accounting for G+C-content bias (ex: Drosophila mitochondrial DNA) Tamura and Nei (1993). (TN93) distinguishes between two types of transition (rate A<->G ≠rate C<->T

45 Models of DNA Evolution
Limitations to Jukes–Cantor and Kimura-2 parameter (and others) : Assume base composition or amino acid composition is uniform and stationary over time. When this is not the case, these methods can produce distance matrices that lead to incorrect tree inference. Other correction methods are available in those cases.

46 Sequence-based phylogenetic analysis
Select a sequence of interest: Whole gene, region of a gene (coding or non-coding), regulatory region of a gene, transposable elements or even a whole genome Identify homologs: Search or acquire data that are homologous to the sequence of interest Align the sequences: Align all the homologous regions to generate a sequence data matrix Correct the distance using models of DNA evolution Calculate the phylogeny based on the alignment

47 Algorithms for tree construction
Based on the distance matrix: Hierarchical clustering methods: UPGMA, Single Linkage and Complete linkage Neighbor-joining Fitch-Margoliash method Maximum Parsimony methods Based on rules (Graphic Matroids) goeBURST Maximum Likelihood methods Bayesian inference methods

48 Hierarchical clustering

49 Cluster analysis Cluster: a collection/partition of data objects (taxa) in a dataset

50 How many “groups” ? Cluster analysis: partitioning of a dataset into clusters. Objects in the same cluster should be “more similar” to each other than objects from different clusters Unsupervised classification (no a priori definition of classes /clusters

51 How many “groups” ? Four steps: Choosing the characters to be measured
Choosing Similarity /Distance Metric (among cases/individuals/taxa) Inter group distance/Linkage calculation : UPGMA /Single Linkage/Complete Linkage /… 4) Cutting the dendrogram at certain levels to define clusters

52 Hierarchical clustering: WPGMA
Weighted Pair Group Method with Arithmetic Mean Sokal and Michener

53 Hierarchical clustering: WPGMA

54 Defining Clusters in a Dendrogram
B F D E A C Clusters: F, DE, B, AC Clusters: F, DE, BAC Clusters: F, DEBAC Which is the correct cut-off: Several methods have been developed based on distances of branches to the cut Aditional data or prior knowledge can help the decision

55 Hierarchical clustering: WPGMA
Assume this tree: Where the distance between taxa can be represented in the following distance matrix: A B C D E F 5 4 7 10 6 9 8 11 Cycle 1: Shortest distance is d(AC)=4 Join AC in subtree: A C 2 Branch length = d(AC)/2 =2

56 Hierarchical clustering: WPGMA
AC B D E F 6 7 10 9 5 8 11 Cycle 2: Shortest distance is d(DE)=5 A C 2 D E 2.5 Join AC in subtree: Branch length = d(DE)/2 =2.5 d((AC)B)=(d(AB)+d(CB))/2 = (5+7)/2=6 d((AC)D)=(d(AD)+d(CD))/2 = (7+7)/2=7 d((AC)E)=(d(AE)+d(CE))/2 = (6+6)/2=6 d((AC)F)=(d(AF)+d(CF))/2 = (8+8)/2=8 1 Cycle 3: Shortest distance is d((AC)B)=6 A C 2 AC B DE F 6 6.5 9.5 8 11 8.5 B 3 Branch length = d((AC)B)/2=3 D E 2.5 d((DE)B)=(d(DB)+d(EB))/2 = (10+9)/2=9.5 d((DE)F)=(d(DF)+d(EF))/2 = (9+8)/2=8.5 d((DE)(AC))=(d(D(AC))+d(E(AC))/2 = (7+6)/2=6.5

57 Hierarchical clustering: WPGMA
Cycle 4: Shortest distance is d((ACB)(DE))=8 1 A C 2 1 ACB DE F 8 9.5 8.5 B 3 Branch length = d((ACB)(DE))/2=4 D E 2.5 d((ACB)(F))=(d((AC)F)+d(BF))/2 = (8+11)/2=9.5 d((ACB)(DE))=(d((AC)(DE))+d(B(DE))/2 = ( )/2=8 1.5 Cycle 5: Shortest distance is d((ACBDE)F)=9 1 A C 2 1 ACBDE F 9 B d((ACBDE)F)=(d((ACB)F)+d((DE)F)/2 = ( )/2=9 0.5 3 Branch length = d((ACBDE)F))/2=4.5 D E 2.5 1.5 F 4.5

58 Hierarchical clustering: WPGMA
1 A C 2 1 B 0.5 3 D E 2.5 1.5 F 4.5 The resulting UPGMA tree differs in branch order Root assigned at mid-point

59 Hierarchical clustering
Advantages of hierarchical clustering methods: Speed (O (n2)) Limitations of UPGMA/WPGMA/Single Linkage/Complete Linkage: Assumption that the rates of evolutionary change are uniform in different evolutionary branches ( same molecular clock in different branches). Assumes an Ultrameric Tree, i.e., any 3 taxa {A,B,C} , DAC ≤max(DAB,DBC)

60 Hierarchical clustering: Single and Complete Linkage
Give two groups g1 and g2 Complete linkage The distance between groups is the distance between the farthest pair Pulls groups farther apart Single linkage The distance between groups is the distance between the closest pair Pulls groups closer together

61 UPGMA/Single Linkage /Complete Linkage
W/UPGMA Single Linkage Group average centroid

62 WPGMA vs UPGMA WPGMA UPGMA https://en.wikipedia.org/wiki/WPGMA

63 See you tomorrow… .... Or next on the TP


Download ppt "Molecular Phylogenetics (part 1 of 2)"

Similar presentations


Ads by Google