Download presentation
Presentation is loading. Please wait.
1
Inferring phylogenetic trees: Distance methods
Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
2
One-minute responses Thank you for this lecture. It was very interesting. I think I’m starting to program like a pro. I wish to hear more on how we can understand better the evolutionary relationships among species, preferably among distinct human populations. I think I enjoyed today’s lecture. More especially the class problems! 70% of the course has been understood by me. Tell us more about interpretations. Python part was easy to follow today. Python part was very easy to follow. I did not have any problem for the first time. The lecture was well understood. The Python part was not so easy for me, but OK. I appreciate the revision every day, it is very helpful. Can we learn how to have better output from Python (form / appearance)? Can we work at this stage on real human genetic data?
3
Outline Parsimony Distance methods Maximum likelihood
Computing distances Finding the tree Maximum likelihood
4
Revision What is the input to a phylogenetic inference problem?
A multiple alignment of DNA or protein sequences. What is the output? A binary tree showing the inferred evolutionary relationships. For what types of phylogenetic inference problems is maximum parsimony the right approach? Small numbers of input sequences. Closely related sequences. What are the two computational problems that must be solved in a maximum parsimony approach? Enumerating all possible tree topologies. Evaluating the parsimony score for a given topology.
5
Revision Evaluate the parsimony score of the given tree with respect to the first column of the given alignment. Skud Sbay R S Score = 1 Scer Svin Scer RTGH Skud RTGV Sbay RVGV Smik SVGH Spom STIL Svin RLGH Smik Spom
6
Revision Repeat, but use the second column of the alignment. Scer RTGH
Skud Sbay T V L Score = 2 X Scer Smik Scer RTGH Skud RTGV Sbay RVGV Smik SVGH Spom STIL Svin RLGH Svin Spom
7
Selecting a method Choose set of related sequences Obtain multiple
alignment Is there strong sequence similarity? Yes Maximum parsimony methods No Is there clearly recognizable sequence similarity Yes Distance methods No Maximum likelihood methods
8
Distance methods Multiple sequence alignment Pairwise distance matrix
Phylo- genetic tree
9
The distance between species 1 and 2 is the sum of X and Y.
Calculating distance ACTGAACGTAACGC X Y Species 2: AATGAAAGAATCGC Species 1: ACTGTAGGAATCGC Species 1: ACTGTAGGAATCGC Species 2: AATGAAAGAATCGC The distance between species 1 and 2 is the sum of X and Y.
10
True evolutionary history
Ancestral Species 1 Species 2 A C T G A A C G T A A C G C A C T G A C T A C G G T A A A C T C G C A C A T G A A C A G T A A A T C G C T C Single substitution Multiple substitutions Coincidental substitutions Parallel substitutions Convergent substitution Back substitution
11
Jukes-Cantor model Assume the same probability of change at all positions and all times. dAB is the proportion of changed sites in the alignment. KAB is the expected number of changes per position. Derivation at
12
3 observed changes in 20 sites
Jukes-Cantor model Species 1 Species 2 3 observed changes in 20 sites A C T G A C T A C G G T A A A C T C G C A C A T G A A C A G T A A A T C G C T C
13
Computing JK distances
Proportion of changed sites Species 1: ACGTGATCGGTGA Species 2: ACTTGATGCCTAG Species 3: A-TTACGTAATGG Species 4: A-TTGATGGCGTA 1 2 3 4 Pairwise distances 1 2 3 4
14
Computing JK distances
Proportion of changed sites Species 1: ACGTGATCGGTGA Species 2: ACTTGATGCCTAG Species 3: A-TTACGTAATGG Species 4: A-TTGATGGCGTA 1 2 3 4 6/12 8/12 5/12 7/12 4/12 9/12 Pairwise distances 1 2 3 4 ?
15
Computing JK distances
Proportion of changes sites Species 1: ACGTGATCGGTGA Species 2: ACTTGATGCCTAG Species 3: A-TTACGTAATGG Species 4: A-TTGATGGCGTA 1 2 3 4 6/12 8/12 5/12 7/12 4/12 9/12 From this matrix, we calculate the tree. Pairwise distances 1 2 3 4 0.82
16
Other models Jukes-Cantor Kimura F84, HKY Tamura-Nei
The simplest possible model Kimura 2 parameters Differentiates between transitions and transversions. F84, HKY 5 parameters Allows arbitrary base frequencies. Tamura-Nei 6 parameters Combination of F84 and HKY. General time-reversible model 12 parameters Only assumes Pr(x→y) = Pr(y→x)
17
Distance methods Fitch-Margoliash Neighbor-joining UPGMA Multiple
sequence alignment Pairwise distance matrix Phylo- genetic tree
18
UPGMA Unweighted pair group method with arithmetic mean.
Also known as agglomerative hierarchical clustering. Basic idea: iteratively connect the two most closely related sequences.
19
UPGMA Scer Spar Smik Sbay Skud Scas Sklu 31 40 32 30 323 253 26 37 300
31 40 32 30 323 253 26 37 300 229 25 35 290 219 298 227 316 243 95
20
UPGMA Find the smallest off-diagonal element in the matrix. Scer Spar
Smik Sbay Skud Scas Sklu 31 40 32 29 323 253 26 37 30 300 229 25 35 290 219 298 227 316 243 95 Find the smallest off-diagonal element in the matrix.
21
UPGMA Compute the average between the two rows and columns. Scer Spar
Smik Sbay Skud Scas Sklu 31 40 32 29 323 253 26 37 30 300 229 25 35 290 219 298 227 316 243 95 Compute the average between the two rows and columns.
22
UPGMA Scer Spar Smik Sbay Skud Scas Sklu 31 36 29 323 253 31.5 30 300
31 36 29 323 253 31.5 30 300 229 32.5 294 223 316 243 95
23
UPGMA Each merger creates a subtree. Smik Sbay Scer Spar Smik-Sbay
Skud Scas Sklu 31 36 29 323 253 31.5 30 300 229 32.5 294 223 316 243 95 Smik Sbay Each merger creates a subtree.
24
Perform the next merger
Scer Spar Smik-Sbay Skud Scas Sklu 31 36 29 323 253 31.5 30 300 229 32.5 294 223 315 243 316 95 Smik Sbay
25
Smik Sbay Scer Spar Smik-Sbay Skud Scas Sklu 31 36 29 323 253 31.5 30
31 36 29 323 253 31.5 30 300 229 32.5 294 223 315 243 316 95 Smik Sbay
26
Smik Sbay Skud Scer Spar Smik-Sbay Skud-Scer Scas Sklu 31.5 30.5 300
31.5 30.5 300 229 34.25 294 223 319.5 248 95 Smik Sbay Skud Scer
27
What is next? Smik Sbay Skud Scer Spar Smik-Sbay Skud-Scer Scas Sklu
31.5 30.5 300 229 34.25 294 223 319.5 248 95 Smik Sbay Skud Scer
28
Formatting with % Insert % between a string and a tuple to get formatted output. Use %s for strings, %d for integers, and %f or %g for floats. Use %f for a fixed number of decimal places, %e for exponent, %g for either. %g rounds to specified number of digits of precision %g uses either fixed or exponential notation, depending on the value Use leading numbers to specify width. Replace with * to provide width as an input. Full details at
29
Problem #1 Write a program that reads sequences from a given file and prints, in aligned columns, the sequence ID, length and frequency of each letter. You may assume that each sequence is no more than 100,000 characters. Version 1: Use the alphabet ACGT and a fixed width for the sequence ID. Version 2: Adjust the field width of the sequence ID based on the longest sequence ID. Version 2: Use the alphabet of the given sequences. Print fields in alphabetical order. Version 3: Add a header line to your output file. ./compute-seq-stats.py sample-dna.txt Read 11 sequences from sample-dna.txt. ce1cg A=0.17 C=0.12 G=0.31 T=0.40 ara A=0.34 C=0.23 G=0.18 T=0.24 bglr A=0.41 C=0.13 G=0.07 T=0.39 crp A=0.35 C=0.20 G=0.22 T=0.23 cya A=0.24 C=0.19 G=0.21 T=0.36 deop A=0.29 C=0.11 G=0.25 T=0.34 gale A=0.30 C=0.23 G=0.12 T=0.34 ilv A=0.22 C=0.26 G=0.17 T=0.35 lac A=0.22 C=0.22 G=0.22 T=0.34 male A=0.31 C=0.24 G=0.28 T=0.17 malk A=0.26 C=0.15 G=0.37 T=0.22
30
> ./compute-seq-stats-4.py ribosomal.txt
Read 13 sequences from ribosomal.txt. Longest sequence ID = 32. 20 letters in alphabet. Alphabet=['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']. Sequence Len A C D E F G H I K L M N P Q R S T V W Y gi| |ref|XP_ | gi| |emb|CCD gi| |gb|EMH gi| |pdb|3ZEY|U gi| |ref|XP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_ gi| |ref|NP_
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.