Presentation is loading. Please wait.

Presentation is loading. Please wait.

Christian M Zmasek, PhD 15 June 2010.

Similar presentations


Presentation on theme: "Christian M Zmasek, PhD 15 June 2010."— Presentation transcript:

1 Christian M Zmasek, PhD czmasek@burnham.org 15 June 2010

2 1. Why perform phylogenetic inference? 2. Theoretical background 3. Methods 4. Software & Examples (C) 2010 Christian M. Zmasek2

3  ‘Tree of life’: The relationships amongst different species  Infer the functions of proteins from family members in model organisms or to refine existing annotations through phylogenetic analysis  A method to organize/cluster sequences with biological justification (C) 2010 Christian M. Zmasek3

4 RAT MOUSE HUMAN RICE LIZARD SHARK RAT MOUSE HUMAN RICE LIZARD SHARK Y Z X Z Y : query sequence : orthologous to query : most similar to query : gene duplication (C) 2010 Christian M. Zmasek4

5 RAT WHEAT HUMAN BARLEY Y Z : query sequence : orthologous to query : most similar to query : gene duplication (C) 2010 Christian M. Zmasek5

6  A phylogeny is the evolutionary history of a species or a group of species. Lately, the term is also being applied to the evolutionary history of individual DNA or protein sequences.  The evolutionary history of organisms or sequences can be illustrated using a tree-like diagram – a phylogenetic tree. (C) 2010 Christian M. Zmasek6

7 7

8  Initially, phylogenetic trees were built based on the morphology of organisms.  Around 1960 molecular sequences were recognized as containing phylogenetic information and hence as valuable for tree building  A tree built based on sequence data is called a gene tree since it is a representation of the evolutionary history of genes  A tree illustrating the evolutionary history of organisms is called a species tree (C) 2010 Christian M. Zmasek8

9 9

10 10

11  Homologs are defined as sequences which share a common ancestor (Fitch, 1966)  This definition becomes unclear if mosaic proteins, which are composed of structural units originating from different genes are considered  Phylogenetic trees make sense only if constructed based on homologous sequences (whole genes/proteins, or domains) (C) 2010 Christian M. Zmasek11

12  Homologous sequences can be divided into orthologs, paralogs and xenologs:  Orthologs: diverged by a speciation event (their last common ancestor on a phylogenetic tree corresponds to a speciation event)  IMPORANT: Functional similarity does not imply orthology  Paralogs: diverged by a duplication event (their last common ancestor corresponds to a duplication)  Xenologs: are related to each other by horizontal gene transfer (via retroviruses, for example) (C) 2010 Christian M. Zmasek12

13 (C) 2010 Christian M. Zmasek13

14  Orthologous sequences tend to have more similar “functions” than paralogs  Yet: Orthologs are mathematically defined, whereas there is no definition of sequence “function” (i.e. it is a subjective term) (C) 2010 Christian M. Zmasek14

15  New genes evolve if mutations accumulate while selective constraints are relaxed by gene duplication  First recognized by Haldane (“… it [mutation pressure] will favour polyploids, and particularly allopolyploids, which possess several pairs of sets of genes, so that one gene may be altered without disadvantage…” (C) 2010 Christian M. Zmasek15

16 HumanRatWheatHumanRat Wheat Human Rat Wheat Human Rat Wheat G1G1 G2G2 S (C) 2010 Christian M. Zmasek16

17 Multiple sequence alignment of homologous sequences Pairwise distance calculation Algorithmic Methods Based on Pairwise Distances: UPGMA Neighbor Joining Optimality Criteria Based on Pairwise Distances: Fitch-Margoliash Minimal Evolution Optimality Criteria Based on Character Data: Maximum Parsimony Maximum Likelihood “More accurate” (in general) Fast Bayesian Methods (MCMC) (C) 2010 Christian M. Zmasek17

18 The simplest method to measure the distance between two amino acid sequences is by their fractional dissimilarity p (n d is the number of aligned sequence positions containing non-identical amino acids and n s is the number of aligned sequence positions containing identical amino acids): (C) 2010 Christian M. Zmasek18

19  Unfortunately, this is unrealistic -- does not take into account:  superimposed changes: multiple mutations at the same sequence location  different chemical properties of amino acids: for example, changing leucine into isoleucine is more likely and should be weighted less than changing leucine into proline (C) 2010 Christian M. Zmasek19

20  A more realistic approach for estimating evolutionary distances is to apply maximum likelihood to empirical amino acid replacement models, such as PAM transition probability matrices.  The likelihood L H of a hypothesis H (an evolutionary distance, for example) given some data D (an alignment, for example) is the probability of D given H: L H =P(D|H) (C) 2010 Christian M. Zmasek20

21  UPGMA stands for unweighted pair group method using arithmetic averages  This is clustering  This algorithm produces rooted trees based under the assumption of a molecular clock. (C) 2010 Christian M. Zmasek21

22  As opposed to UPGMA, neighbor joining (NJ) is not misled by the absence of a molecular clock  NJ produces phylogenetic trees (not cluster diagrams) (C) 2010 Christian M. Zmasek22

23  Fitch-Margoliash  Minimal evolution (ME)  Maximum Parsimony (MP)  Maximum Likelihood (ML) (C) 2010 Christian M. Zmasek23

24  Branch lengths are fitted to a tree according to a unweighted least squares criterion, but the optimality criterion to evaluate and compare trees is to minimize the sum of all branch lengths. (C) 2010 Christian M. Zmasek24

25  Evaluate a given topology  Example: Sequence1: TGC Sequence2: TAC Sequence3: AGG Sequence4: AAG (C) 2010 Christian M. Zmasek 25

26  Probabilistic methods can be used to assign a likelihood to a given tree and therefore allow the selection of the tree which is most likely given the observed sequences.  Probability for one residue a to change to b in time t along a branch of a tree: P(b|a,t)  Its actual calculation is dependent on what model for sequence evolution is used.  Poisson process:  P(b|a,t)=1/20 + 19/20e -ut for a=b  P(b|a,t)=1/20 + 1/20e -ut for a≠b (C) 2010 Christian M. Zmasek26

27  Example: MrBayes  Use Markov Chain Monte Carlo (MCMC) approach to sample over tree space (C) 2010 Christian M. Zmasek27

28  To asses the reliability of trees  Resampling with replacement (see example on next slide)  What is “good enough”?? >60%?, >90%? (C) 2010 Christian M. Zmasek28

29 Original sequence alignment: Sequence 1: ARNDCQ Sequence 2: VRNDCQ 123456 Bootstrap resample 1: Sequence 1: RRQCCA Sequence 2: RRQCCV 226551 Bootstrap resample 2: Sequence 1: AQCDCQ Sequence 2: VQCDCQ 165456 (C) 2010 Christian M. Zmasek29

30 Multiple sequence alignment of homologous sequences Pairwise distance calculation Algorithmic Methods Based on Pairwise Distances: UPGMA Neighbor Joining Optimality Criteria Based on Pairwise Distances: Fitch-Margoliash Minimal Evolution Optimality Criteria Based on Character Data: Maximum Parsimony Maximum Likelihood “More accurate” (in general) Fast Bayesian Methods (MCMC) (C) 2010 Christian M. Zmasek30

31  Mafft:  http://mafft.cbrc.jp/alignment/software/ http://mafft.cbrc.jp/alignment/software/  Server: http://mafft.cbrc.jp/alignment/server/http://mafft.cbrc.jp/alignment/server/  T-Coffee:  http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html http://www.tcoffee.org/Projects_home_page/t_coffee_home_page.html  Server: http://www.ch.embnet.org/software/TCoffee.htmlhttp://www.ch.embnet.org/software/TCoffee.html  Server: http://www.ebi.ac.uk/t-coffee/http://www.ebi.ac.uk/t-coffee/  ClustalW:  ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/  Server: http://www.ebi.ac.uk/clustalw/http://www.ebi.ac.uk/clustalw/  Probcons:  http://probcons.stanford.edu/ http://probcons.stanford.edu/  Server: http://probcons.stanford.eduhttp://probcons.stanford.edu  Muscle:  http://www.drive5.com/muscle/ http://www.drive5.com/muscle/  Server: http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.pyhttp://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py (C) 2010 Christian M. Zmasek31

32  List of programs: http://evolution.genetics.washington.edu/phylip/software.htmlhttp://evolution.genetics.washington.edu/phylip/software.html  ML pairwise distance calculation (protein):  TREE-PUZZLE: http://www.tree-puzzle.de/http://www.tree-puzzle.de/  Bootstrapping, pairwise distance calculation, UPGMA, NJ, Fitch-Margolish, ME:  PHYLIP: http://evolution.genetics.washington.edu/phylip.htmlhttp://evolution.genetics.washington.edu/phylip.html  ME:  FastME (server): http://atgc.lirmm.fr/fastme/http://atgc.lirmm.fr/fastme/  MEGA: http://www.megasoftware.net/http://www.megasoftware.net/  ML:  PhyML (server): http://www.atgc-montpellier.fr/phyml/http://www.atgc-montpellier.fr/phyml/  RAxML (server): http://phylobench.vital-it.ch/raxml-bb/http://phylobench.vital-it.ch/raxml-bb/  Bayesian (MCMC):  MrBayes: http://mrbayes.csit.fsu.edu/http://mrbayes.csit.fsu.edu/  Parsimony (esp. on Macintosh), display:  PAUP: http://paup.csit.fsu.edu/http://paup.csit.fsu.edu/  Tree display:  Archaeopteryx: http://www.phylosoft.org/archaeopteryx/http://www.phylosoft.org/archaeopteryx/  Hypothesis testing:  HyPhy: http://www.hyphy.org/http://www.hyphy.org/ (C) 2010 Christian M. Zmasek32

33  Richard Durbin et al.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids [http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic- Proteins/dp/0521629713/sr=1-1/qid=1170198997/ref=sr_1_1/102-4955297- 1236120?ie=UTF8&s=books]http://www.amazon.com/Biological-Sequence-Analysis-Probabilistic- Proteins/dp/0521629713/sr=1-1/qid=1170198997/ref=sr_1_1/102-4955297- 1236120?ie=UTF8&s=books  Joe Felsenstein: Inferring Phylogenies [http://www.amazon.com/Inferring-Phylogenies- Joseph-Felsenstein/dp/0878931775/sr=8-1/qid=1170198215/ref=pd_bbs_sr_1/102-4955297- 1236120?ie=UTF8&s=books]http://www.amazon.com/Inferring-Phylogenies- Joseph-Felsenstein/dp/0878931775/sr=8-1/qid=1170198215/ref=pd_bbs_sr_1/102-4955297- 1236120?ie=UTF8&s=books  Ziheng Yang: Computational Molecular Evolution [http://www.amazon.com/Computational-Molecular-Evolution-Oxford- Ecology/dp/0198567022/sr=1-1/qid=1170198731/ref=pd_bbs_sr_1/102-4955297- 1236120?ie=UTF8&s=books]http://www.amazon.com/Computational-Molecular-Evolution-Oxford- Ecology/dp/0198567022/sr=1-1/qid=1170198731/ref=pd_bbs_sr_1/102-4955297- 1236120?ie=UTF8&s=books  Oliver Gascuel: Mathematics of Evolution & Phylogeny [http://www.amazon.com/Mathematics-Evolution-Phylogeny-Olivier- Gascuel/dp/0198566107/sr=1-1/qid=1170198842/ref=sr_1_1/102-4955297- 1236120?ie=UTF8&s=books]http://www.amazon.com/Mathematics-Evolution-Phylogeny-Olivier- Gascuel/dp/0198566107/sr=1-1/qid=1170198842/ref=sr_1_1/102-4955297- 1236120?ie=UTF8&s=books (C) 2010 Christian M. Zmasek33

34  Download and install MrBayes: http://mrbayes.csit.fsu.edu/http://mrbayes.csit.fsu.edu/  Read the tutorial: http://mrbayes.csit.fsu.edu/wiki/index.php/Tutorialhttp://mrbayes.csit.fsu.edu/wiki/index.php/Tutorial  Analyze the provided data set (“primates.nex”)  Download and install PHYLIP: http://evolution.genetics.washington.edu/phylip.html http://evolution.genetics.washington.edu/phylip.html  Perform seqboot (100x) – dnadist – neighbor (NJ) – consense on “primates.nex” (you need to change the format accordingly)  Compare the results (MrBayes vs. Phylip NJ) (C) 2010 Christian M. Zmasek34


Download ppt "Christian M Zmasek, PhD 15 June 2010."

Similar presentations


Ads by Google