Multiple Alignment, Distance Estimation, and Phylogenetic Analysis

Multiple Alignment, Distance Estimation, and Phylogenetic Analysis
Database search (keyword, similarity) Conserved Regions Multiple alignment • find oligonucleotide primers for PCR • predict secondary and tertiary structures of new sequences • detect similarity between new sequences and existing sequence families • find diagnostic patterns to characterize protein families Distance estimation Phylogenetic reconstruction Function Prediction April 2, 2004 BIOS816/VBMS818

Distance estimation ACTGTAGGAATCGC :X::X:X::::::: AATGAAAGAATCGC
nd = 3 L = 14 p = 3/14 = 0.214 The easiest Number of nucleotide substitutions per site (p) p = nd / L nd: the number of different nucleotides between the two sequences L: the number of nucleotides compared • It can be applied both for DNA and protein sequences April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC
Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Single substitution Ancestral ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Single substitution ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

No hidden substitution
Distance estimation AATGTAGGAATCGC ACTGTAGGAATCGC AATGAAAGAATCGC 3 substitutions No hidden substitution April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions G ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC April 2, 2004 BIOS816/VBMS818

Distance estimation AATGTAGGAATCGC G C ACTGTAGGAATCGC AATGAAAGAATCGC
2 substitutions 2 substitutions G C ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 April 2, 2004 BIOS816/VBMS818

Actual number of substitutions = 6
Distance estimation AATGTAGGAATCGC 2 substitutions 2 substitutions G C Multiple hit! ACTGTAGGAATCGC AATGAAAGAATCGC Observed number of differences = 3 < Actual number of substitutions = 6 April 2, 2004 BIOS816/VBMS818

Effect of multiple substitutions (hits)
(Actual number of substitutions) Actual divergence Divergence Time April 2, 2004 BIOS816/VBMS818

(Actual number of substitutions) Actual divergence Divergence Observed divergence Time April 2, 2004 BIOS816/VBMS818

(Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Actual > Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Actual >> Observed (Actual number of substitutions) Actual divergence Divergence Observed divergence Actual = Observed Time April 2, 2004 BIOS816/VBMS818

Distance estimation with multiple-hit corrections
(nucleotide substitutions) Jukes-Cantor method (one-parameter method) Kimura’s 2-parameter method A C G T -  A C G T -   k = -3/4ln(1-4p/3) k: the expected number of nucleotide substitutions per site p: the proportion of nucleotide differences All substitutions are equally likely Transitions and Transversions have different rates k = -1/2ln[1/(1-2P-Q)]+1/4ln[1/(1-2Q)] P: the proportion of transitional (Ts) differences Q: the proportion of transversional (Tv) differences April 2, 2004 BIOS816/VBMS818

(nucleotide substitutions) k = -3/4ln(1-4p/3) If p ≥ 0.75, JC distance cannot be estimated k (Jukes-Cantor distance) (k = p) p = 0.75 p (uncorrected nucleotide difference) April 2, 2004 BIOS816/VBMS818

(nucleotide substitutions) There are many distance estimation methods based on different models. 1. More parameters: • 1-parameter (Jukes-Cantor method) • 2-parameter (Kimura’s 2-p method) • 3, 4, 6, ... up to 12 parameters!! 2. Variation in base composition (A  C  G  T): • 1-p & base comp. (Tajima & Nei or F81 method) • 2-p & base comp. (HKY85 or F84 method), etc. 3. Rate-variation among sites: approximated by a gamma-distribution CV (coefficient of variation of the rate): smaller  less variation • 1-p & rate variation (Jin & Nei method), etc. 4. LogDet method: • No constrains on parameters, base composition can be varied among sequences • No among-site rate variation can be considered A C G T - C G T A G T A C April 2, 2004 BIOS816/VBMS818

(nucleotide substitutions) Which distance method should we choose? Substitution pattern (e.g., Ts/Tv) Things to consider: Base composition bias Rate-heterogeneity among sites 1. More parameters  more flexible, more realistic 2. More parameters  larger sampling errors (lower precision) 3. More parameters  more “undefined” distance problem (e.g., if p ≥ 0.75 in JC method, k becomes “undefined” or “infinite”) [k = -3/4ln(1-4p/3)] April 2, 2004 BIOS816/VBMS818

(amino acid substitutions) 1. Poisson distance: k = -ln(1-p) k: the expected number of amino acid substitutions per site p: the proportion of amino acid differences 2. Kimura’s distance: k = -ln(1-p-0.2p2) • Approximation of PAM distance below (accurate when p < 0.75) • Distance becomes infinite when p ≥ 3. PAM distance & JTT distance • Distance based on PAM or JTT amino acid substitution matrix • JTT matrix is newer and based on much larger protein sample NOTE for ClustalW/ClustalX Kimura’s distance Hybrid between Kimura’s and PAM distances p ≤ 0.75 Use Kimura’s correction 0.75 < p ≤ 0.93 Use a conversion table with 0.01 interval (.75, .751, ...) 0.93 ≤ p k = 10.0 April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction methods
Neighbor Joining (NJ) Maximum Parsimony (MP) Maximum Likelihood (ML) Data type: Distance Minimum evolution (shortest total branch length) *NJ does not search the ME tree. NJ provides a simplified (approximated) algorithm to find the ME tree. Sequence (or other) data Maximum parsimony (smallest number of evolutionary changes) Maximum Likelihood (highest probability of observing the data under a given tree and a given model of substitutions) Optimality criterion: Fastest Slowest April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: distance matrix methods
UPGMA (unweighted pair-group method with arithmetic mean) Example: a distance matrix for 5 sequences. B C D E C/D A/B/E A .53 .99 1.02 .82 A/B .90 .98 .78 .94 .80 .93 .73 .65  .86 .81  The pair with the smallest distance is grouped  until all of the sequences are clustered in a tree • assumes all sequences evolve at the same rate • generates a rooted tree April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: distance matrix methods
NJ (neighbor joining method) Example: a distance matrix for 5 sequences. B C D E A .53 .99 1.02 .82 .80 .93 .73 .65 .81 .94 1) Start with a star-like phylogeny. 2) The total length of the tree (the sum of the branch lengths) is estimated. 3) Find neighbors sequentially that minimize the total length of the tree. • does not assume a constant rate • generates a unrooted tree April 2, 2004 BIOS816/VBMS818

Rooted trees vs. unrooted trees
A B C D (Time) A B C D A B C D April 2, 2004 BIOS816/VBMS818

A B C D (Time) A B C D Root A B C D B A C D April 2, 2004 BIOS816/VBMS818

A B C D (Time) A B C D A B C D B A C D C D A B April 2, 2004 BIOS816/VBMS818

A B C D (Time) A B C D Root A B C D April 2, 2004 BIOS816/VBMS818

A B C D (Time) A B C D A B C D Outgroup April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction: bootstrap analysis
used to estimate the confidence level of phylogenetic hypotheses Multiple alignment S5 S4 S3 S2 S1 T C 8 G A 1 2 3 7 6 5 4 Site # April 2, 2004 BIOS816/VBMS818

used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 Each column is independent sample April 2, 2004 BIOS816/VBMS818

used to estimate the confidence level of phylogenetic hypotheses T C 8 G A 1 2 3 7 6 5 4 S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG April 2, 2004 BIOS816/VBMS818

used to estimate the confidence level of phylogenetic hypotheses S1 GGAGGTTA S1 CGCAGCAC S1 TTTTGGCG ... Many S2 GGAGGTTA S2 CGCAGTAT S2 TTTTGGTG resamplings S3 AAAAACTA S3 CACAACAC S3 CTCTAACA (~1000 replications) S4 AAAAATCA S4 CACAACAC S4 TCTCAACA S5 GGAGGTCG S5 TGTAGCGC S5 TCTCGGCG S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... A phylogeny is reconstructed from each pseudoreplicate April 2, 2004 BIOS816/VBMS818

used to estimate the confidence level of phylogenetic hypotheses S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 ... Bootstrap support (%) 100 S1 100 S1 S2 S2 S3 S3 S4 S4 40 S5 S5 April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction by ClustalX/W & Phylip
Phylogeny programs: ClustalW/ClustalX (DNA/protein distance, NJ) Phylip3.5 & Phylip3.6a3 (standalone, web-interface) PAUP (also included in GCG) Visualization: Phylip (treegram, etc.), TreeView, PAUP, NJplot More phylogeny programs  April 2, 2004 BIOS816/VBMS818

Phylip programs Bootstrap: seqboot (sequence data)
DNA distance: dnadist (nucleotide sequence data) Protein distance: protdist (amino acid sequence data) Neighbor joining: neighbor (distance matrix) Consensus tree: consense (tree file) Tree drawing: drawgram, drawtree, retree (treefile) Phylip3.5 Phylip3.6a3 Input file: infile infile, intree Output file: outfile, treefile outfile, outtree dnadist: Kimura, Jin/Nei, ML (F84), F84, Kimura, JC, LogDet JC (rate variation can be incorporated with all but LogDet) protdist: Kimura, PAM (Dayhoff) Kimura, PAM (Dayhoff), JTT April 2, 2004 BIOS816/VBMS818

Phylogenetic reconstruction by ClustalW & Phylip
Bioinformatics Core Facility Web server (Phylip3.5) Bioinformatics Web: IU Center for Genomics & Bioinformatics (Phylip3.5) Institut Pasteur, Biological Software list (Phylip3.6a3) Phylip download site (Windows, Macintosh, Linux/Unix) Phylip3.5: Phylip3.6b: TreeView download site (Windows, Macintosh, Linux/Unix) NJPlot download site (Windows, Macintosh, Linux/Unix) April 2, 2004 BIOS816/VBMS818

Multiple Alignment by CLUSTALW
Bioinformatics Core Facility Web server Bioinformatics Web: IU Center for Genomics & Bioinformatics Institut Pasteur, Biological Software list EMBL-EBI ClustalW Form ClustalX FTP site (Windows, Macintosh, Linux/Unix) ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise
1. Download the two sample data from the course web site: bglobin.seq - protein sequences Dloop.seq - DNA sequences [Use either DOS format or non-DOS format whichever the ones that work for you.]  These sequences are in FASTA format. >HBB_HUMAN VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >HBB_HORSE VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK DFTPELQASYQKVVAGVANALAHKYH April 2, 2004 BIOS816/VBMS818

ClustalW/Phylip Exercise (continued)
2. Go to this ClustalW web site: 3. Enter your address. 4. Copy and paste bglobin.seq data. 5. Check “Phylip alignment ouput format”. 6. Check the available options. 7. Click “Run clustalw” button to start the alignment. Wait until the page changes to the results page... 8. Click “infile.phy” link to open the multiple alignment in Phylip format. 9. From the pull-down menu, choose “protdist” program. April 2, 2004 BIOS816/VBMS818

10. Click “Run the selected program on infile.phy” button to start protdist. 11. Choose “Jones-Taylor-Thornton (JTT) matrix” as the distance model. 12. Check “Perform a bootstrap ...” option. Enter a “Random number seed” Enter 10 as the number of replicates.  For the real analysis, use more than 500 or more (~1000). 13. Click “Run protdist” button to run the program. Wait until the page changes to the results page... 14. Click “outfile” link to check the distance matrix file you generated. 15. From the pull-down menu, choose “neighbor” program. April 2, 2004 BIOS816/VBMS818

16. Click “Run the selected program on outfile” button to start neighbor. 17. Choose “Neighbor-joining” from the distance method. 18. Check Randomize (jumble) input order. Enter a “Random number seed”  Using “randomization” option slows down the program. But usually it is a better idea to use this option to avoid any artifact. 19. Check “Analyze multiple data set”.  If you are not doing boostrap analysis, you don’t have to check this option. Enter the number of data set (10 for this example) Check “Compute a consense tree” 20. Click “Run neighbor” to run the program. Wait until the page changes to the results page... April 2, 2004 BIOS816/VBMS818

21. Click “outfile.consense” to open the output file. 22. Click “outtree.consense” to open the output file.  Note that the numbers after the taxon names are not branch lengths but bootstrap values. 23. Click “outtree” to open the output file. This file contain 10 trees based on bootstrapped alignment. Save the first tree in a file to use it for TreeView demonstration.  We are using this tree as an example. For the real analysis, you should create a NJ tree without doing bootstrap analysis to create a real NJ tree from the original multiple alignment.  In the item 12, uncheck “Perform a bootstrap ...” option to generate the NJ tree without bootstrap analysis. April 2, 2004 BIOS816/VBMS818

TreeView Exercise 24. Find “TreeView” software on your machine and start the program. 25. From File menu, open the tree file you saved. 26. Try to click different tree icons to change the phylogeny format.  Which format shows different branch lengths? 27. From Tree menu, select “Define outgroup” Choose one sequence as an outgroup. 28. From Tree menu, select “Root with outgroup” 28. From Edit menu, select “Edit tree”. Check how you can edit your tree. The assignment from my lectures (Assignment #4) is found in the Blackboard Assignment page . April 2, 2004 BIOS816/VBMS818

Multiple Alignment, Distance Estimation, and Phylogenetic Analysis

Similar presentations

Presentation on theme: "Multiple Alignment, Distance Estimation, and Phylogenetic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Alignment, Distance Estimation, and Phylogenetic Analysis

Similar presentations

Presentation on theme: "Multiple Alignment, Distance Estimation, and Phylogenetic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback