Presentation on theme: "Molecular Phylogeny Analysis, Part II. Mehrshid Riahi, Ph.D. Iranian Biological Research Center (IBRC), July 14-15, 2012."— Presentation transcript:
Molecular Phylogeny Analysis, Part II. Mehrshid Riahi, Ph.D. Iranian Biological Research Center (IBRC), July 14-15, 2012
Topics A few examples of what can be inferred from phylogenetic trees Tree-Building Methods Introduction to Distance and Character Based Phylogeny Maximum Parsimony (MP) and Neighbour-joining (NJ) Analysis using PAUP* 4.08b
What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: 1.Phylogeny inference or “tree building” 2.Character and rate analysis
A few examples of what can be inferred from phylogenetic trees built from DNA or protein sequence data: Which species are the closest living relatives of modern humans? Did the infamous Florida Dentist infect his patients with HIV? Mapping evolutionary transitions Geographic origins Plus countless others…..
Which species are the closest living relatives of modern humans? Mitochondrial DNA and most nuclear DNA-encoded genes, The pre-molecular view MYA Chimpanzees Orangutans Humans Bonobos Gorillas Humans Bonobos GorillasOrangutans Chimpanzees MYA
Phylogenetic Analysis of HIV Virus Lafayette, Louisiana, 1994 – A woman claimed her dentist injected her with HIV+ blood Records show the dentist had drawn blood from an HIV+ patient that day But how to prove the blood from that HIV+ patient ended up in the woman?
HIV Transmission HIV has a high mutation rate, which can be used to trace paths of transmission Two people who got the virus from two different people will have very different HIV sequences Three different tree reconstruction methods (including parsimony) were used to track changes in two genes in HIV (gp120 and RT)
HIV Transmission Took multiple samples from the patient, the woman, and controls (non-related HIV+ people) In every reconstruction, the woman’s sequences were found to be evolved from the patient’s sequences, indicating a close relationship between the two Nesting of the victim’s sequences within the patient sequence indicated the direction of transmission was from dentist to victim This was the first time phylogenetic analysis was used in a court case as evidence (Metzker, et. al., 2002)
Did the Florida Dentist infect his patients with HIV? DENTIST Patient D Patient F Patient C Patient A Patient G Patient B Patient E Patient A Local control 2 Local control 3 Local control 9 Local control 35 Local control 3 Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No From Ou et al. (1992) and Page & Holmes (1998) Phylogenetic tree of HIV sequences from the DENTIST, his Patients, & Local HIV-infected People:
How Many Times Evolution Invented Wings? Whiting, et. al. (2003) looked at winged and wingless stick insects
Reinventing Wings Previous studies had shown winged wingless transitions Wingless winged transition much more complicated (need to develop many new biochemical pathways) Used multiple tree reconstruction techniques, all of which required re- evolution of wings
Most Parsimonious Evolutionary Tree of Winged and Wingless Insects The evolutionary tree is based on both DNA sequences and presence/absence of wings Most parsimonious reconstruction gave a wingless ancestor
Blood squirting? No Yes Mapping evolutionary transitions Some horned lizards squirt blood from their eyes when attacked by canids How many times has blood-squirting evolved? Testing evolutionary hypotheses
Blood squirting? No Yes Mapping evolutionary transitions Some horned lizards squirt blood from their eyes when attacked by canids How many times has blood-squirting evolved? This phylogeny suggests a single evolutioary gain and a single loss of blood squirting Testing evolutionary hypotheses
Matsuoka et al. (2002) A B Testing evolutionary hypotheses Geographic origins Where did domestic corn (Zea mays maize) originate? Populations from Highland Mexico are at the base of each maize clade
There are three possible unrooted trees for four taxa (A, B, C, D) AC B D Tree 1 AB C D Tree 2 AB D C Tree 3 Which one is correct?
Trees can be unrooted or rooted These trees show five different evolutionary relationships among the taxa!
x = C A B D AD B E C A D B E C F (2N - 3)!! = # unrooted trees for N taxa Each unrooted tree theoretically can be rooted anywhere along any of its branches
How to root? Using “outgroups” - the outgroup should be a taxon known to be less closely related to the rest of the taxa (ingroups) - it should ideally be as closely related as possible to the rest of the taxa while still satisfying the above condition
Types of phylogenetic analysis methods Phenetic: trees are constructed based on observed characteristics, not on evolutionary history Cladistic: trees are constructed rely on assumptions about ancestral relationships as well as on current data; Distance methods Parsimony and Maximum Likelihood methods
Types of data used in phylogenetic inference: Character-based methods: Use the aligned characters, such as DNA or protein sequences, directly during tree inference. Taxa Characters Species AATGGCTATTCTTATAGTACG Species BATCGCTAGTCTTATATTACA Species CTTCACTAGACCTGTGGTCCA Species DTTGACCAGACCTGTGGTCCG Species ETTGACCAGTTCTCTAGTTCG Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. A B C D E Species A Species B Species C Species D Species E Example 1: Uncorrected “p” distance (=observed percent sequence difference) Example 2: Kimura 2-parameter distance (estimate of the true number of substitutions between taxa)
Types of computational methods: Clustering algorithms: Use pairwise distances. Are purely algorithmic methods. Optimality approaches: Use either character or distance data. - minimum branch lengths, - fewest number of events, - highest likelihood
Molecular phylogenetic tree building methods: COMPUTATIONAL METHOD Clustering algorithmOptimality criterion DATA TYPE Characters Distances PARSIMONY MAXIMUM LIKELIHOOD UPGMA NEIGHBOR-JOINING MINIMUM EVOLUTION LEAST SQUARES
Tree-Building Methods Distance NJ Character Maximum Parsimony
Character Methods Maximum Parsimony minimal changes to produce data
Parsimony methods: Optimality criterion: The ‘most-parsimonious’ tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences.
Parsimony methods Parsimony methods are based on the idea that the most probable evolutionary pathway is the one that requires the smallest number of changes from some ancestral state For sequences, this implies treating each position separately and finding the minimal number of substitutions at each position
Example of parsimonious tree building Tree on left requires only one change, tree on left requires two: left tree is most parsimonious
Parsimony methods assign a cost to each tree available to the dataset, then screen trees available to the dataset and select the most parsimonious Screening all the trees available to even a smallish dataset would take too much time; branch and bound method builds trees with increasing numbers of leaves but abandons the topology whenever the current tree has a bigger cost than any complete tree
1. Extract Outgroup Species A Species B Species C Molecular characters 2. Sequence AAGCTTCATAGGAGCAACCATTCTAATAATAAGCCTCATAAAGCC AAGCTTCACCGGCGCAGTTATCCTCATAATATGCCTCATAATGCC GTGCTTCACCGACGCAGTTGTCCTCATAATGTGCCTCACTATGCC GTGCTTCACCGACGCAGTTGCCCTCATGATGAGCCTCACTATGCA 3. Align
AAGCTTCATA GAGCTTCACA GTGCTTCACG Outgroup Species A Species B Species C Molecular characters Out A B C Invariable sites These are not useful phylogenetic characters Out A B C
AAGCTTCATA GAGCTTCACA GTGCTTCACG Outgroup Species A Species B Species C Molecular characters Out A B C AGAG TCTC Any mutations at this time would affect A, B and C because they have not yet diverged Synapomorphies supporting A+B+C Out A B C
AAGCTTCATA GAGCTTCACA GTGCTTCACG GTGCCTCACG Outgroup Species A Species B Species C Molecular characters Out A B C AGAG TCTC Any mutations at this time would affect A and B Synapomorphies supporting A+B+C ATAT AGAG Synapomorphies supporting B+C Out A B C
AAGCTTCATA GAGCTTCACA GTGCTTCACG GTGCCTCACG Outgroup Species A Species B Species C Molecular characters Out A B C AGAG TCTC Synapomorphies supporting A+B+C ATAT AGAG Synapomorphies supporting B+C Out A B C Apomorphy for C Any mutations at this time would only affect CTCTC
Algorithms used for tree searching I. Exhaustive search: all possibilities → best tree → requires lots of time and computer resources II. Branch and Bound: a tree is built according to the model given → the tree is compared to the next tree while its constructed → if the first tree is better the second tree is abandoned → third tree… → best possible tree III. Heuristic Search: only the most likely options → saves time and resources, does not always result in the best tree
Exhaustive Search If 11 or fewer OTUs can do an exhaustive search - this guarantees the shortest tree(s) will be found (an exact solution) - every possible tree for n taxa examined - slowest and most rigorous method - provides a frequency histogram of tree scores
Tree searching If OTUs can do a branch and bound search - this also guarantees the shortest tree(s) will be found but not all trees are examined (also an exact solution) - families of trees that cannot lead to shorter trees are discarded and not examined - saves time - faster than exhaustive search - no histogram of tree scores
Tree searching For more than 25 OTUs (most datasets) must use other methods, heuristic searching – approximate methods - do not guarantee the shortest tree will be found
Heuristic Tree searching Stepwise addition - builds starting tree (PAUP options) Asis - the order in the data matrix (poor start unless you’ve sorted the OTUs in some phylogenetic) Closest -starts with shortest 3-taxon tree adds taxa in order that produces the least increase in tree length (greedy heuristic, like NJ) - will produce a ‘good’ starting tree but produces same starting tree each time it is used (unless there are ‘ties’ which are randomly broken)
Heuristic Tree Searching Simple - the first taxon in the matrix is a taken as a reference - taxa are added to it in the order of their decreasing similarity to the reference Random - taxa are added in a random sequence, typically one would perform many replicates each starting with random addition of taxa - most rigorous
Branch Swapping PAUP allows 3 different types of branch swapping listed in order of increasing rigor: - Nearest neighbor interchange (NNI) - Subtree Pruning and Regrafting (SPR) - Tree Bisection-Reconnection (TBR)
Branch Swapping Tree Bisection- Reconnection (TBR) Most thorough branch swapping procedure Tree is broken at internal branch & all possible reconnections are made between 2 subtrees
Statistical Methods to Evaluate Trees 1.Bootstrapping Bootstrapping is a statistical technique that can use random re ‐ sampling of data to determine sampling error for tree topologies Agreement among the resulting trees is summarized with a majority ‐ rule consensus tree n number of trees are built (n=100/1000/5000) → How many times a certain branch is reproduced Each branch of the tree is labelled with the % of bootstrap trees where it occurred. 80% is good, less than 50% is bad
Bootstrapping Constructs a new multiple alignment at random from the real alignment, with the same size. Note that the same column can be sampled more than once, and consequently some columns are not sampled.
Statistical Methods to Evaluate Trees 2.Consensus Trees If you get multiple trees, look for regions that are similar. Those are the regions that you can be more confident are correct.
In-class exercise I Use data set and program, choose maximum parsimony. Use heuristic for the tree building method. Inspect your tree.
Distance Methods Measure distance (dissimilarity) Methods UPGMA (Unweighted pair group method with Arithmetic Mean) NJ (Neighbor joining) FM (Fitch-Margoliash) ME (Minimal Evolution)
NJ example: Step 1 Alignment -> distance ABCDEFG A- B63- C9479- D E F G Example: observed percent sequence difference Distance: Distance matrix:
Step 2: distance -> clade ABCDEFG A- B63- C9479- D E F G
ABCEFDG A- B63- C9479- E F DG Step 3: merge D and G
ABCEFDG A- B63- C9479- E F DG Step 4
AFBCEDG AF- B61- C9279- E DG Step 5
AFBCEDG AF- B61- C9279- E DG Step 6
AFBECDG AF- BE63- C9271- DG Step 7
AFBECDG AF- BE63- C9271- DG Step 8
Step 9 AFBECDG AF- BE63- CDG10288-
Step 10 AFBECDG AF- BE63- CDG AF
Root AFBECDG AFBE- CDG94- AF NJ: distance -> phylogeny AF
In-class exercise II Use same data set and program as in exercise I, but choose distance. Use NJ for the tree building method. Inspect your tree. Compare it to the parsimony generated tree.