Presentation is loading. Please wait.

Presentation is loading. Please wait.

MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments.

Similar presentations


Presentation on theme: "MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments."— Presentation transcript:

1 MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments grows rapidly with the number of sequences being aligned.

2 “Even using supercomputers or networks of workstations, multiple sequence alignment is an intractable problem for more than 20 or so sequences of average length and complexity.”

3 As a result, alignment methods using heuristics have been developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

4 CLUSTALW Developed in 1988 Begins by aligning closely related sequences and then adds increasingly divergent sequences to produce a complete msa.

5 http://www.ncbi.nlm.nih.gov/ http://www.ebi.ac.uk/clustalw/

6 Introduction to Molecular Phylogeny* *Phylogeny- the evolutionary history of a group

7 Mutations Happen! 3 types possible: Deleterious Advantageous ???

8 Important Point: Much of variation that is observed among individuals must have little beneficial or detrimental effect and be essentially selectively neutral. Deleterious mutations are screened out. Advantageous mutations are rare.

9 Functional Constraints? Portions of genes that especially important are said to be under functional constraint and tend to accumulate changes very slowly. Ex. = histone proteins- practically every amino acid is important. A yeast histone can replace a human histone.

10 Relative Rate of Change within  -globin gene (4 mammals)

11 Basis of Molecular Phylogenetics The evolution of species can be modeled as a bifurcating process- speciation is initiated when two populations become reproductively isolated.

12 Basis of Molecular Phylogenetics Once these two populations cease to interbreed, it is inevitable that they diverge due to random mutational processes.

13 Basis of Molecular Phylogenetics Over time, this branching process may repeat itself. A species is said to be related to some other species with which it shares a direct common ancestor.

14

15 Basis of Molecular Phylogenetics The amount of DNA sequence difference between a pair of organisms should indicate how recently those two organisms shared a common ancestor.

16

17 Basis of Molecular Phylogenetics The longer two populations remain reproductively isolated, the more DNA divergence will occur. The longer two populations remain reproductively isolated, the more protein divergence will occur.

18 Molecular Phylogeny is relatively new. Evolution by Natural Selection- Darwin/Wallace 1858 Molecular Phylogeny 1960s ??

19 How it started.... In 1959, scientists determined the three- dimensional structures of two proteins that are found in almost every animal: hemoglobin and myoglobin. During the next two decades, myoglobin and hemoglobin sequences were determined for dozens of mammals, birds, reptiles, amphibians, fish, etc.

20 What they found... “This tree agreed completely with observations derived from paleontology and anatomy about the common descent of the corresponding organisms.”* * from Science and Creationism: A View from the National Academy of Sciences, 2nd Ed., 1999.

21 Organisms with high degrees of molecular similarity are expected to be more closely related than those that are dissimilar.

22 Advantages of Molecular Phylogeny Can be used to decipher relationships between all living things Relying on anatomy can be misleading- Similar traits can evolve in organisms that are not closely related (i.e. convergent evolution lead to eyes in vertebrates, insects, and molluscs).

23 Word of Caution Phylogenetic analysis is controversial. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.

24 Why so controversial?? 2 Reasons:

25 #1 - Molecular vs. Classical How much weight is given to molecular phylogenetic data, when it contrasts the findings of the traditional taxonomist??

26 ... The phylogeny of whales :

27

28 How many cars changed spaces during this 2 hour interval? Parking lot “A” at 2:00  Parking lot “A” at 4:00 

29 #2- Molecular Phylogeny requires statistical estimations. Parking lot “A” at 2:00  Parking lot “A” at 4:00 

30 Phylogenetic Data Analysis requires 4 steps 1) Alignment 2) Determine the substitution model 3) Tree Building 4) Tree Evaluation

31 STEP 1- Alignment Molecular phylogenetic analysis is dependent on a good alignment. An evolutionary tree based on an improper alignment is an erroneous tree.

32

33 Homology It is critical to phylogenetic analysis that homologous characters be compared across species. Webster’s New Collegiate- Fundamental similarity of structure due to descent from a common ancestral form.

34

35

36 Compare homologous genes and homologous characters: For DNA and proteins, this means that gaps must be placed correctly in multiple alignments to ensure that the same position is being compared for each species.

37 Homologous Genes? When could you accidentally compare nonhomologous genes? Be careful if you comparing genes that are members of a gene family. Comparing a tubulin-3 from one species with a tubulin-6 from another will not generate accurate results.

38 What to align? Phylogenetic trees are generated by comparing DNA or protein. The molecule of choice depends on the question you are attempting to answer.

39 DNA contains more evolutionary information than protein : ATT GCG AAA CAC * * * * ATA GCC AAG CTC

40 Protein (same region analyzed  only 1 difference) Ile-Ala-Lys- His Ile-Ala-Lys- Leu

41 DNA high rate of base substitution makes DNA best for very short term studies, e.g. closely- related species

42

43 * Homoplasy Return of a character to its original state, thus masking intervening mutational events. Every fourth mutation should result in a homoplasy.

44 Protein more reliable alignment than DNA: fewer homoplasies than DNA lower rate of substitution than DNA; better for wide species comparisons

45

46 rRNA= ribosomal RNA Best for very long term evolutionary studies spanning biological kingdoms Selective processes constraining sequence evolution should be roughly the same across species boundaries

47 STEP 2- Determine the substitution model.

48 A nucleotide substitution rate matrix: ATCG A5-4 T 5 C 5 G 5

49 Step 3- Tree Building

50 Tree terminology: Nodes: branching points Branches: lines Topology: branching pattern

51 Branches can be rotated at a node, without changing the relationships.

52

53 Unrooted trees explain phylogenetic relationships; they say nothing about the directions of evolution- the order of descent

54

55

56 There are two main tree drawing methods. - Character Methods - Distance Methods Both approaches are widely used and work well with most data sets.

57 Distance methods Distance- a measure of the overall pairwise difference between two data sets. The raw material for tree reconstruction is tabular summaries of the pairwise differences between all data sets to be analyzed

58 In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. SpeciesABCD B9----- C811----- D121510----- E1518135

59 Distance methods Identify the sequence pairs that have the smallest number of sequence changes between them and are identified as ‘neighbors’. On a tree, these sequences share a common ancestor and are joined by a short branch.

60 UPGMA, pairwise distance and neighbor joining are distance methods. They progressively group sequences, starting with those that are most alike. UPGMA = unweighted-pair-group method with arithmetic mean

61 Phylogenetic trees based on distance methods. 1)The two sequences that are closest together are connected at a node. 2)The process is repeated until all sequences are joined. 3)Addition of the last sequence defines the root of the tree.

62 The branch lengths may reflect the degree of similarity (and theoretically reflect evolutionary time). Scaled trees- when branch length are proportional to the differences between base pairs. In the best of cases, scaled trees are additive (the physical length of branches connecting any two nodes is an accurate representation of their accumulated differences).

63

64 Phylogenetic trees based on distance methods. Relatively simple. Problem: –May not be accurate!!

65 Character Methods “There is no denying that distance- based methods “look at the big picture” and pointedly ignore much potentially valuable information.”

66 Character Methods Analysis of individual characters are translated into evolutionary trees. Character- a well-defined feature that can exist in a limited number of different states. (Ex. DNA and protein sequences)

67 The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction. The process of attaching preference to one evolutionary pathway over another on the basis of which pathway requires the invocation of the smallest number of mutational events.

68 Character-based methods of phylogenetic reconstruction. “The relationship that requires the fewest number of mutations to explain the current state of affairs is most likely to be correct”

69 First Step in Character Methods: Identify all of the informative sites:

70 2nd step: Calculate the minimum number of substitutions at each informative site:

71 Final step: After sequences are aligned, algorithms model each tree.

72 Maximum parsimony is a character method Character methods require a multiple sequence align. Analysis of informative ‘characters’ is used to construct an evolutionary tree.

73 Maximum Parsimony: General scientific criterion for choosing among competing hypotheses states that we should accept the hypothesis that explains the data most simply and efficiently. The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.

74 Maximum Parsimony: The algorithm searches for a tree that requires the smallest number of changes to explain the differences observed among the groups under study.

75 Character methods are best suited for... Sequences that are quite similar. Small number of sequences The method is computationally time consuming as all possible trees are examined.

76 Phylogenetic trees based on maximum likelihood: The aim is to find the tree (among all possible trees) that has the highest likelihood of producing the observed data (statistical methods).

77 Phylogenetic trees based on maximum likelihood are similar to maximum parsimony methods but also take into account the likelihood of specific mutations (ex. A  G).

78 Mutation Rates Vary: Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).

79 Many of the methods described require significant amounts of computer time. Why?

80 Number of possible rooted and unrooted trees # of Data Sets# of Rooted Trees# of Unrooted Trees 211 331 4153 510515 1034,459,4252,027,025 15 213,458,046,767,875 7,905,853,580,625 20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875

81

82 Programs take shortcuts. When a large number of tree is being compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.

83 Here are some 194 of the phylogeny packages, and 16 free servers, that I know about. Updates to these pages are made about twice a year.

84 Tree Evaluation Every ‘tree drawing program’ will generate a tree. The important question is whether or not the tree drawn is the right one. In some cases, there are many trees of similar probabilities.

85 Vertebrate  -globins:

86

87 Bootstrap method of assessing tree reliability: Inferred tree is constructed from data set. Re-run the calculation on subsets of the data (resampling). Resampling is repeated several (100-1000) times.

88

89 Bootstrap method Bootstrap trees are constructed from the resampled data sets. Bootstrap tree is compared to original inferred tree. % of bootstrap trees supporting a node are determined for each node in the tree.

90 Molecular Clock Addition of time to phylogenetic tree. Units of time are often in millions of years. Assumption- substitution rates are constant over millions of years.

91 Molecular Clock Rates of molecular evolution for genes with similar functional constraints can be quite uniform. (Clock may run at different rates in different proteins.)

92 The End

93 Evolutionary biology also has benefited greatly from genome- sequencing projects. The wealth of new genome data is helping to better resolve the tree of life, particularly its major branches. This has been especially true for prokaryotes, where more than 80 genomes have been sequenced so far and the results have greatly improved our view of the early history of life.

94 Problem- As the # of sequences increases, the # of possible trees increases dramatically # of sequences# of trees 31 43 515 6105 7945 810,395 9135,135 101,027,025 502.8 x 10 74

95 Phylogenetic trees based on neighbor joining. Also utilizes a ‘distance matrix’ Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree. Can produce reasonable trees, especially when evolutionary distances are short.

96 For vertebrates, many thorny issues remain to be resolved, such as the phylogeny of families and other major groups in the tree of life. For example, it is not yet known whether humans are closer to mice or to cattle because different results have been obtained with different gene analyses. On the other hand, there is no guarantee that complete genome sequences will immediately solve all phylogenetic questions, as evidenced by the continuing debate over the relationships among humans, flies, and nematodes. We will need to develop new statistical methods and bioinformatics tools to handle the greater volume of data and to unravel the complexities of molecular evolution.

97 Today: The examination of molecular structure offers an extremely powerful tool for studying evolutionary relationships. The quantity of information is huge--as large as the thousands of different proteins contained in living organisms, and limited only by the time and resources of molecular biologists.

98 Choice of individual genes or proteins.

99 Determine the substitution model May be an amino acid substitution rate matrix such as PAM or BLOSUM. ADD DEMO.

100 Maximum parsimony and maximum likelihood are character methods Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time.

101 Distance matrices: Scoring matrices include values for all possible substitutions. Each mismatch between two sequences adds to the distance, and each identity subtracts from the distance.


Download ppt "MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments."

Similar presentations


Ads by Google