Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Similar presentations


Presentation on theme: "CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:"— Presentation transcript:

1 CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.www.cse.sc.edu HAPPY CHINESE NEW YEAR

2 Outline Review For Exams Data for Phylogenetic Tree inference Classification of Tree inference approaches Neighbor-joining algorithm Parsimony-based tree reconstruction Least Square Best-fit reconstruction 3/21/20162

3 Midterm, Midterm How to review: read slides and textbooks, especially CG book. Format of problems: examples ◦ Brief questions: what is the difference between global alignment and local alignment? ◦ calculation: build a HMM model for a multiple seq alignment ◦ Definition: blasting, Motif, ORF

4 Covered Topics Understand: concepts, algorithm ideas, tools ◦ Sequencing/blasting ◦ Gene finding ◦ Alignment algorithms and applications ◦ DNA motif search ◦ HMM profiles ◦ Gene prediction algorithms ◦ Promoter predictions ◦ Comparative genomics ◦ ……

5 A B C D E 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1234567890 1 Characters Taxa A B C D E Distances Phylogenetic Reconstruction There are essentially two types of data for phylogenetic tree estimation: ◦ Distance data, usually stored in a distance matrix, e.g. DNA×DNA hybridisation data, morphometric differences, immunological data, pairwise genetic distances ◦ Character data, usually stored in a character array;  e.g. multiple sequence alignment of DNA sequences, morphological characters.

6 Phylogenetic Reconstruction Given the huge number of possible trees even for small data sets, we have two options: ◦ Build one according to some clustering algorithm ◦ Assign a “goodness of fit” criterion (an objective function) and find the tree(s) which optimise(s) this criterion

7 CS369 2007 7 Distances Nucleotide Sites Type of Data UPGMA Neighbor-Joining Minimum Evolution Maximum Parsimony Maximum Likelihood Tree Building Method Optimality Criterion Clustering Algorithm Phylogenetic Reconstruction

8 Phylogenetic Methods Maximum likelihood Maximizes likelihood of observed data Many different procedures exist. Three of the most popular: Maximum parsimony Minimizes total evolutionary change Neighbor-joining Minimizes distance between nearest neighbors

9 Distance based tree Construction Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances. Orc: ACAGTGACGCCCCAAACGT Elf: ACAGTGACGCTACAAACGT Dwarf: CCTGTGACGTAACAAACGA Hobbit: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Orc Elf Dwarf Hobbit Human

10 Distance Matrix Given n species, we can compute the n x n distance matrix D ij D ij may be defined as the edit distance between a gene in species i and species j, where the gene of interest is sequenced for all n species. Dij can also be any other feature-based distances

11 Distances in Trees Edges may have weights reflecting: ◦ Number of mutations on evolutionary path from one species to another ◦ Time estimate for evolution of one species into another In a tree T, we often compute d ij (T) - the length of a path between leaves i and j

12 Distances in Trees Edges may have weights reflecting: ◦ Number of mutations on evolutionary path from one species to another ◦ Time estimate for evolution of one species into another In a tree T, we often compute d ij (T) - the length of a path between leaves i and j

13 Distance in Trees: an Exampe d 1,4 = 12 + 13 + 14 + 17 + 12 = 68 i j

14 Fitting Distance Matrix Given n species, we can compute the n x n distance matrix D ij Evolution of these genes is described by a tree that we don’t know. We need an algorithm to construct a tree that best fits the distance matrix D ij

15 Reconstructing a 3 Leaved Tree Tree reconstruction for any 3x3 matrix is straightforward We have 3 leaves i, j, k and a center vertex c Observe: d ic + d jc = D ij d ic + d kc = D ik d jc + d kc = D jk

16 Reconstructing a 3 Leaved Tree d ic + d jc = D ij + d ic + d kc = D ik 2d ic + d jc + d kc = D ij + D ik 2d ic + D jk = D ij + D ik d ic = (D ij + D ik – D jk )/2 Similarly, d jc = (D ij + D jk – D ik )/2 d kc = (D ki + D kj – D ij )/2

17 Trees with > 3 Leaves An tree with n leaves has 2n-3 edges This means fitting a given tree to a distance matrix D requires solving a system of “n choose 2” equations with 2n-3 variables This is not always possible to solve for n > 3

18 Additive Distance Matrices Matrix D is ADDITIVE if there exists a tree T with d ij (T) = D ij NON-ADDITIVE otherwise

19 Distance Based Phylogeny Problem Goal: Reconstruct an evolutionary tree from a distance matrix Input: n x n distance matrix D ij Output: weighted tree T with n leaves fitting D If D is additive, this problem has a solution and there is a simple algorithm to solve it

20 Using Neighboring Leaves to Construct the Tree Find neighboring leaves i and j with parent k Remove the rows and columns of i and j Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: D km = (D im + D jm – D ij )/2 Compress i and j into k, iterate algorithm for rest of tree

21 Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves.

22 Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. WRONG

23 Finding Neighboring Leaves Closest leaves aren’t necessarily neighbors i and j are neighbors, but (d ij = 13) > (d jk = 12) Finding a pair of neighboring leaves is a nontrivial problem!

24 Neighbor Finding: Seitou & Nei algorithm (1987) Theorem (Saitou & Nei) Assume all edge weights are positive. If D(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree. Definitions

25 Neighbor Joining Algorithm In 1987 Naruya Saitou and Masatoshi Nei developed a neighbor joining algorithm for phylogenetic tree reconstruction Finds a pair of leaves that are close to each other but far from other leaves: implicitly finds a pair of neighboring leaves Advantages: works well for additive and other non-additive matrices, it does not have the flawed molecular clock assumption

26 Neighbor-joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define D ij = d ij – (r i + r j ) Where 1 r i = –––––  k d ik |L| - 2 1 2 4 3 0.1 0.4

27 Algorithm: Neighbor-joining Initialization: Define T to be the set of leaf nodes, one per sequence Let L = T Iteration: Pick i, j s.t. D ij is minimal Define a new node k, and set d km = ½ (d im + d jm – d ij ) for all m  L Add k to T, with edges of lengths d ik = ½ (d ij + r i – r j ) Remove i, j from L; Add k to L Termination: When L consists of two nodes, i, j, and the edge between them of length d ij

28 Rooting a tree, and definition of outgroup Neighbor-joining produces an unrooted tree How do we root a tree between N species using n-j? An outgroup is a species that we know to be more distantly related to all remaining species, than they are to one another Example: Human, mouse, rat, pig, dog, chicken, whale Which one is an outgroup? Outgroup can act as a root 1 2 3 4

29 Neighbor Joining Algorithm-Widely Used Applicable to matrices which are not additive Known to work good in practice The algorithm and its variants are the most widely used distance-based algorithms today.

30 Maximum Parsimony Method for Tree Inference A Character-based method Input: h sequences (one per species), all of length k. Goal: Find a tree with the input sequences at its leaves, and an assignment of sequences to internal nodes, such that the total number of substitutions is minimized. Two sub-problems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

31 Example Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. AGA AAA GGA AAG AAA 2 1 1 Total #substitutions = 4 By the parsimony principle, we seek a tree that has a minimum total number of substitutions of symbols between species and their originator in the phylogenetic tree. Here is one possible tree.

32 Least Squares Distance Phylogeny Problem If the distance matrix D is NOT additive, then we look for a tree T that approximates D the best: Squared Error : ∑ i,j (d ij (T) – D ij ) 2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: finding the best approximation tree T for a non- additive matrix D (NP-hard).

33 Search through tree topologies: Branch and Bound Observation: adding an edge to an existing tree can only increase the parsimony cost Enumerate all unrooted trees with at most n leaves: [i 3 ][i 5 ][i 7 ]……[i 2N–5] ] where each i k can take values from 0 (no edge) to k At each point keep C = smallest cost so far for a complete tree Start B&B with tree [1][0][0]……[0] Whenever cost of current tree T is > C, then: ◦ T is not optimal ◦ Any tree with more edges containing T, is not optimal: Increment by 1 the rightmost nonzero counter

34 Comparison of Methods Neighbor-joiningMaximum parsimonyMaximum likelihood Very fastSlowVery slow Easily trapped in local optima Assumptions fail when evolution is rapid Highly dependent on assumed evolution model Good for generating tentative tree, or choosing among multiple trees Best option when tractable (<30 taxa, strong conservation) Good for very small data sets and for testing trees built using other methods

35 Summary Category of phylogenetic inference algorithms Neighbor-joining algorithm

36 Acknowledgement Anonymous authors


Download ppt "CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:"

Similar presentations


Ads by Google