MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Review of cladistic technique Shared derived (apomorphic) traits are useful in understanding evolutionary relationships Shared primitive (plesiomorphic)
Heuristic alignment algorithms and cost matrices
Bioinformatics and Phylogenetic Analysis
Sequence similarity.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
Phylogenetic trees Sushmita Roy BMI/CS 576
What Is Phylogeny? The evolutionary history of a group.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
 Read Chapter 4.  All living organisms are related to each other having descended from common ancestors.  Understanding the evolutionary relationships.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Molecular phylogenetics 4 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Systematics and the Phylogenetic Revolution Chapter 23.
Evolutionary Biology Concepts Molecular Evolution Phylogenetic Inference BIO520 BioinformaticsJim Lund Reading: Ch7.
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
5.4 Cladistics The images above are both cladograms. They show the statistical similarities between species based on their DNA/RNA. The cladogram on the.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Phylogeny and the Tree of Life
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Chapter 19 Molecular Phylogenetics
Presentation transcript:

MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments grows rapidly with the number of sequences being aligned.

“Even using supercomputers or networks of workstations, multiple sequence alignment is an intractable problem for more than 20 or so sequences of average length and complexity.”

As a result, alignment methods using heuristics have been developed. These methods, (including ClustalW) cannot guarantee an optimal alignment, but can find near-optimal alignments for larger number of sequences.

CLUSTALW Developed in 1988 Begins by aligning closely related sequences and then adds increasingly divergent sequences to produce a complete msa.

Introduction to Molecular Phylogeny* *Phylogeny- the evolutionary history of a group

Mutations Happen! 3 types possible: Deleterious Advantageous ???

Important Point: Much of variation that is observed among individuals must have little beneficial or detrimental effect and be essentially selectively neutral. Deleterious mutations are screened out. Advantageous mutations are rare.

Functional Constraints? Portions of genes that especially important are said to be under functional constraint and tend to accumulate changes very slowly. Ex. = histone proteins- practically every amino acid is important. A yeast histone can replace a human histone.

Relative Rate of Change within  -globin gene (4 mammals)

Basis of Molecular Phylogenetics The evolution of species can be modeled as a bifurcating process- speciation is initiated when two populations become reproductively isolated.

Basis of Molecular Phylogenetics Once these two populations cease to interbreed, it is inevitable that they diverge due to random mutational processes.

Basis of Molecular Phylogenetics Over time, this branching process may repeat itself. A species is said to be related to some other species with which it shares a direct common ancestor.

Basis of Molecular Phylogenetics The amount of DNA sequence difference between a pair of organisms should indicate how recently those two organisms shared a common ancestor.

Basis of Molecular Phylogenetics The longer two populations remain reproductively isolated, the more DNA divergence will occur. The longer two populations remain reproductively isolated, the more protein divergence will occur.

Molecular Phylogeny is relatively new. Evolution by Natural Selection- Darwin/Wallace 1858 Molecular Phylogeny 1960s ??

How it started.... In 1959, scientists determined the three- dimensional structures of two proteins that are found in almost every animal: hemoglobin and myoglobin. During the next two decades, myoglobin and hemoglobin sequences were determined for dozens of mammals, birds, reptiles, amphibians, fish, etc.

What they found... “This tree agreed completely with observations derived from paleontology and anatomy about the common descent of the corresponding organisms.”* * from Science and Creationism: A View from the National Academy of Sciences, 2nd Ed., 1999.

Organisms with high degrees of molecular similarity are expected to be more closely related than those that are dissimilar.

Advantages of Molecular Phylogeny Can be used to decipher relationships between all living things Relying on anatomy can be misleading- Similar traits can evolve in organisms that are not closely related (i.e. convergent evolution lead to eyes in vertebrates, insects, and molluscs).

Word of Caution Phylogenetic analysis is controversial. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.

Why so controversial?? 2 Reasons:

#1 - Molecular vs. Classical How much weight is given to molecular phylogenetic data, when it contrasts the findings of the traditional taxonomist??

... The phylogeny of whales :

How many cars changed spaces during this 2 hour interval? Parking lot “A” at 2:00  Parking lot “A” at 4:00 

#2- Molecular Phylogeny requires statistical estimations. Parking lot “A” at 2:00  Parking lot “A” at 4:00 

Phylogenetic Data Analysis requires 4 steps 1) Alignment 2) Determine the substitution model 3) Tree Building 4) Tree Evaluation

STEP 1- Alignment Molecular phylogenetic analysis is dependent on a good alignment. An evolutionary tree based on an improper alignment is an erroneous tree.

Homology It is critical to phylogenetic analysis that homologous characters be compared across species. Webster’s New Collegiate- Fundamental similarity of structure due to descent from a common ancestral form.

Compare homologous genes and homologous characters: For DNA and proteins, this means that gaps must be placed correctly in multiple alignments to ensure that the same position is being compared for each species.

Homologous Genes? When could you accidentally compare nonhomologous genes? Be careful if you comparing genes that are members of a gene family. Comparing a tubulin-3 from one species with a tubulin-6 from another will not generate accurate results.

What to align? Phylogenetic trees are generated by comparing DNA or protein. The molecule of choice depends on the question you are attempting to answer.

DNA contains more evolutionary information than protein : ATT GCG AAA CAC * * * * ATA GCC AAG CTC

Protein (same region analyzed  only 1 difference) Ile-Ala-Lys- His Ile-Ala-Lys- Leu

DNA high rate of base substitution makes DNA best for very short term studies, e.g. closely- related species

* Homoplasy Return of a character to its original state, thus masking intervening mutational events. Every fourth mutation should result in a homoplasy.

Protein more reliable alignment than DNA: fewer homoplasies than DNA lower rate of substitution than DNA; better for wide species comparisons

rRNA= ribosomal RNA Best for very long term evolutionary studies spanning biological kingdoms Selective processes constraining sequence evolution should be roughly the same across species boundaries

STEP 2- Determine the substitution model.

A nucleotide substitution rate matrix: ATCG A5-4 T 5 C 5 G 5

Step 3- Tree Building

Tree terminology: Nodes: branching points Branches: lines Topology: branching pattern

Branches can be rotated at a node, without changing the relationships.

Unrooted trees explain phylogenetic relationships; they say nothing about the directions of evolution- the order of descent

There are two main tree drawing methods. - Character Methods - Distance Methods Both approaches are widely used and work well with most data sets.

Distance methods Distance- a measure of the overall pairwise difference between two data sets. The raw material for tree reconstruction is tabular summaries of the pairwise differences between all data sets to be analyzed

In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. SpeciesABCD B C D E

Distance methods Identify the sequence pairs that have the smallest number of sequence changes between them and are identified as ‘neighbors’. On a tree, these sequences share a common ancestor and are joined by a short branch.

UPGMA, pairwise distance and neighbor joining are distance methods. They progressively group sequences, starting with those that are most alike. UPGMA = unweighted-pair-group method with arithmetic mean

Phylogenetic trees based on distance methods. 1)The two sequences that are closest together are connected at a node. 2)The process is repeated until all sequences are joined. 3)Addition of the last sequence defines the root of the tree.

The branch lengths may reflect the degree of similarity (and theoretically reflect evolutionary time). Scaled trees- when branch length are proportional to the differences between base pairs. In the best of cases, scaled trees are additive (the physical length of branches connecting any two nodes is an accurate representation of their accumulated differences).

Phylogenetic trees based on distance methods. Relatively simple. Problem: –May not be accurate!!

Character Methods “There is no denying that distance- based methods “look at the big picture” and pointedly ignore much potentially valuable information.”

Character Methods Analysis of individual characters are translated into evolutionary trees. Character- a well-defined feature that can exist in a limited number of different states. (Ex. DNA and protein sequences)

The concept of parsimony is at the heart of all character-based methods of phylogenetic reconstruction. The process of attaching preference to one evolutionary pathway over another on the basis of which pathway requires the invocation of the smallest number of mutational events.

Character-based methods of phylogenetic reconstruction. “The relationship that requires the fewest number of mutations to explain the current state of affairs is most likely to be correct”

First Step in Character Methods: Identify all of the informative sites:

2nd step: Calculate the minimum number of substitutions at each informative site:

Final step: After sequences are aligned, algorithms model each tree.

Maximum parsimony is a character method Character methods require a multiple sequence align. Analysis of informative ‘characters’ is used to construct an evolutionary tree.

Maximum Parsimony: General scientific criterion for choosing among competing hypotheses states that we should accept the hypothesis that explains the data most simply and efficiently. The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.

Maximum Parsimony: The algorithm searches for a tree that requires the smallest number of changes to explain the differences observed among the groups under study.

Character methods are best suited for... Sequences that are quite similar. Small number of sequences The method is computationally time consuming as all possible trees are examined.

Phylogenetic trees based on maximum likelihood: The aim is to find the tree (among all possible trees) that has the highest likelihood of producing the observed data (statistical methods).

Phylogenetic trees based on maximum likelihood are similar to maximum parsimony methods but also take into account the likelihood of specific mutations (ex. A  G).

Mutation Rates Vary: Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).

Many of the methods described require significant amounts of computer time. Why?

Number of possible rooted and unrooted trees # of Data Sets# of Rooted Trees# of Unrooted Trees ,459,4252,027, ,458,046,767,875 7,905,853,580, ,200,794,532,637,891,559, ,643,095,476,699,771,875

Programs take shortcuts. When a large number of tree is being compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.

Here are some 194 of the phylogeny packages, and 16 free servers, that I know about. Updates to these pages are made about twice a year.

Tree Evaluation Every ‘tree drawing program’ will generate a tree. The important question is whether or not the tree drawn is the right one. In some cases, there are many trees of similar probabilities.

Vertebrate  -globins:

Bootstrap method of assessing tree reliability: Inferred tree is constructed from data set. Re-run the calculation on subsets of the data (resampling). Resampling is repeated several ( ) times.

Bootstrap method Bootstrap trees are constructed from the resampled data sets. Bootstrap tree is compared to original inferred tree. % of bootstrap trees supporting a node are determined for each node in the tree.

Molecular Clock Addition of time to phylogenetic tree. Units of time are often in millions of years. Assumption- substitution rates are constant over millions of years.

Molecular Clock Rates of molecular evolution for genes with similar functional constraints can be quite uniform. (Clock may run at different rates in different proteins.)

The End

Evolutionary biology also has benefited greatly from genome- sequencing projects. The wealth of new genome data is helping to better resolve the tree of life, particularly its major branches. This has been especially true for prokaryotes, where more than 80 genomes have been sequenced so far and the results have greatly improved our view of the early history of life.

Problem- As the # of sequences increases, the # of possible trees increases dramatically # of sequences# of trees , , ,027, x 10 74

Phylogenetic trees based on neighbor joining. Also utilizes a ‘distance matrix’ Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree. Can produce reasonable trees, especially when evolutionary distances are short.

For vertebrates, many thorny issues remain to be resolved, such as the phylogeny of families and other major groups in the tree of life. For example, it is not yet known whether humans are closer to mice or to cattle because different results have been obtained with different gene analyses. On the other hand, there is no guarantee that complete genome sequences will immediately solve all phylogenetic questions, as evidenced by the continuing debate over the relationships among humans, flies, and nematodes. We will need to develop new statistical methods and bioinformatics tools to handle the greater volume of data and to unravel the complexities of molecular evolution.

Today: The examination of molecular structure offers an extremely powerful tool for studying evolutionary relationships. The quantity of information is huge--as large as the thousands of different proteins contained in living organisms, and limited only by the time and resources of molecular biologists.

Choice of individual genes or proteins.

Determine the substitution model May be an amino acid substitution rate matrix such as PAM or BLOSUM. ADD DEMO.

Maximum parsimony and maximum likelihood are character methods Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time.

Distance matrices: Scoring matrices include values for all possible substitutions. Each mismatch between two sequences adds to the distance, and each identity subtracts from the distance.