With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Measuring the degree of similarity: PAM and blosum Matrix
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Lecture 8 Alignment of pairs of sequence Local and global alignment
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Bioinformatics and Phylogenetic Analysis
Sequence similarity.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
MSA- multiple sequence alignment Aligning many sequences is often preferable to pairwise comparisons. Problem- Computational complexity of multiple alignments.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
An Introduction to Bioinformatics
Terminology of phylogenetic trees
Molecular phylogenetics
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Systematics the study of the diversity of organisms and their evolutionary relationships Taxonomy – the science of naming, describing, and classifying.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Introduction to Phylogenetics
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Phylogeny Ch. 7 & 8.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Methods of molecular phylogeny
Patterns in Evolution I. Phylogenetic
Molecular Evolution.
Chapter 19 Molecular Phylogenetics
Molecular data assisted morphological analyses
Unit Genomic sequencing
Presentation transcript:

With astonishing advance of the Human Genome Project, essentially all human genomic sequences are available in public databases. The major task for the entire scientific community is to identify medically important genes and determine their functions. Discovery and characterization of corin, the first transmembrane serine protease identified from the heart, exemplifies such a challenge. The bioinformatic and biochemical approaches used in our studies can be applied to study many other genes. Serine proteases are important for a variety of biological processes including food digestion, blood coagulation, host defense and embryonic development. These proteases are also protein targets of pharmaceutical drugs. For example, inhibitors of blood clotting enzymes such as thrombin and factor X are developed to prevent and treat thrombotic diseases.

To identify novel serine proteases in the cardiovascular system, we used the BLAST program to search genomic databases for new genes that share significant homology with serine protease family members, such as trypsin. A partial cDNA sequence (EST) was identified from a human heart library and subsequently used to clone the full-length cDNA of a novel gene, designated corin for its abundant expression in the heart..

Sequence analysis indicates that human corin cDNA encodes a polypeptide of 1042 amino acids. Near the amino terminus of corin, there is a transmembrane domain identified by hydropathy plots using the GCG program. In the extracellular region, corin contains two frizzled-like cysteine-rich motifs, seven low density lipoprotein receptor repeats, a macrophage scavenger receptor-like domain, and a trypsin-like protease domain. Such a unique mosaic domain structure was never found in any of the trypsin superfamily members

To study the function of corin, we performed series of biochemical experiments. Using combined bioinformatic and biochemical approaches, we have solved a long-standing puzzle in the cardiovascular biology.

4951 Projects- Characterize a protein. Carefully select a protein. Suggestions: select a protein that is in a 3D database select a protein that has been studied in many organisms select a protein that has a known activity select a protein for which there are known mutant versions. Use the library to research your chosen protein. Demonstrate how its function is related to its structure. Comment on the evolution of the protein. Comment on the difference between the normal and mutant versions. Find and analyze the DNA of the protein (determine its size, its intron #, chromosomal location, etc.) During your presentation, indicate the names of the software used and databases analyzed to give you your information.

Introduction to Molecular Phylogeny* *Phylogeny- the evolutionary history of a group

Requirement: Basic understanding of evolutionary principles. Basic understanding of mutation at the molecular level

Genetic variation exists. Evolution depends on it. Genetic variation: DNA segments (large or small) can be altered or duplicated or deleted. Point mutations or other small changes (ex. A  G) generate a new version of a gene (i.e. a new allele) New loci are generated by gene duplication events.

Basis of Molecular Phylogenetics To a first approximation, the evolution of species or genes can be modeled as a bifurcating process. Two populations become reproductively isolated and diverge due to random mutational processes. Over time, this process may repeat itself, so that at any time, each population can be said to be most closely- related to some other population with which it shares a direct common ancestor.

Basis of Molecular Phylogenetics If genomes evolve the by gradual accumulation of mutations, then the amount of nucleotide sequence difference between a pair of genomes should indicate how recently those two genomes shared a common ancestor.

Basis of Molecular Phylogenetics Divergence consists of changes in characters, such as amino acids in a protein, or nucleotides in DNA. The longer two populations remain reproductively isolated, the more divergence will occur. Given the existence of homologous characters across a set of populations, it should be possible to work backwards in time, ascending the tree, until a common ancestor of all populations in the set is reached.

Word of Caution Phylogenetic analysis is one of the most controversial areas in bioinformatics. There are a wide variety of different methods for analyzing the data, and even the experts often disagree on the best method for analyzing the data.

Phylogenetic Data Analysis requires 4 steps (text- starting on page 327) 1) Alignment 2) Determine the substitution model 3) Tree Building 4) Tree Evaluation

Alignment Phylogenetic Analyses is very dependent on a good multiple alignment. The alignment of sequences can often have more of an impact on the final tree than the choice of phylogenetic software or phylogenetic parameters.

Homology It is critical to phylogenetic analysis that homologous characters be compared across species. For DNA and proteins, this means that gaps must be correctly in multiple alignments to ensure that the same position is being compared for each species. Consequently, if a multiple alignment is poor, phylogeny construction will also be poor.

What to align? Phylogenetic trees are generated by comparing DNA, RNA, or protein. The molecule of choice depends on the question you are attempting to answer.

DNA/RNA contains more evolutionary information than protein high rate of base substitution makes DNA test for very short term studies e.g.. closely- related species

Protein more reliable alignment than DNA (DNA- 25% = random) fewer homoplasies* than DNA lower rate of substitution than DNA; better for wide species comparisons

* Homoplasy Return of a character to its original state, thus masking intervening mutational events. Homoplasies are most important in DNA sequences, because there are only 4 nucleotides. Every fourth mutation should result in a homoplasy.

rRNA= ribosomal RNA Best for very long term evolutionary studies spanning biological kingdoms Most consistent with an evolutionary clock. Selective processes constraining sequence evolution should be roughly the same across species boundaries

Determine the substitution model- DNA: May be a nucleotide substitution rate matrix: ACGT A-212 C2-21 G12-2 T212-

Mutation Rates Vary: Transitions (purine to purine or pyrimidine to pyrimidine) occur more frequently than transversions (purine to pyrimidine or pyrimidine to purine).

In general, DNA distance matrices are calculated such that each mismatch between two sequences adds to the distance, and each identity subtracts from the distance. Scoring matrices include values for all possible substitutions.

Determine the substitution model May be an amino acid substitution rate matrix such as PAM or BLOSUM.

Tree Building There are four main tree drawing methods. - pairwise distance - neighbor joining - maximum parsimony - maximum likelihood

Basic tree terminology: Nodes: branching points Branches: lines Topology: branching pattern

Branches can be rotated at a node, without changing the relationships.

Phylogenetic trees based on pairwise distance. Simplest to visualize with DNA data: 1)Align each pair of sequences under consideration 2)The two sequences that are closest together are connected at a node. The branch lengths reflect the degree of similarity (and theoretically reflect evolutionary time). 3)The process is repeated until all sequences are joined. 4)Addition of the last sequence defines the root of the tree.

Phylogenetic trees based on pairwise distance. Relatively simple. Problem: –May not be accurate!!

Phylogenetic trees based on neighbor joining. Also utilizes a ‘distance matrix’ Neighbor joining algorithm searches for sets of neighbors that minimize the total length of the tree. Can produce reasonable trees, especially when evolutionary distances are short.

Pairwise distance and neighbor joining are distance methods. There are two main categories of phylogeny methods, distance methods and character methods. In distance methods, the first step is to calculate a matrix of all pairwise differences between a set of sequences. Next, the tree is constructed to minimize the distance when all branches are added together.

Maximum parsimony and maximum likelihood are character methods Character methods attempt to reconstruct ancestral nodes of trees in order to fit the tree to an evolutionary model. They therefore use more of the information in the data, at the expense of longer execution time.

Phylogenetic trees based on maximum parsimony First step in maximum parsimony analysis: Identify all of the informative sites.

Parsimony Analysis 2nd step: Calculate the minimum number of substitutions at each informative site 1 step2 steps

Final step in parsimony analysis: After sequences are aligned, algorithms model each tree: Sum the number of changes over all informative sites for each possible tree.

Parsimony: General scientific criterion for choosing among competing hypotheses states that we should accept the hypothesis that explains the data most simply and efficiently. The tree requiring the _______ number of nucleic acid or amino acid substitutions is selected.

Problem- As the # of sequences increases, the # of possible trees increases dramatically # of sequences# of trees , , ,027, x 10 74

Programs take shortcuts. When a large number of tree is being compared, it is impossible to score each tree. A shortcut algorithm establishes an upper limit. As it evaluates other trees, it throws out any tree exceeding the upper bound before the calculation is completed.

Phylogenetic trees based on maximum likelihood Also evaluates every possible tree topology. ML methods are probabilistic. They assign probabilities to every possible evolutionary change at informative sites.

Phylogenetic trees based on maximum likelihood The aim is to find the tree (among all possible trees) with the highest L (likelihood) value.

Tree Evaluation Bootstrap method of assessing tree reliability: Inferred tree is constructed from data set. Characters are resampled from the data set with replacement. Resampling is repeated several ( ) times.

Bootstrap method Bootstrap trees are constructed from the resampled data sets. Bootstrap tree is compared to original inferred tree. % of bootstrap trees supporting a node are determined for each node in the tree.

Why the controversy?? Molecular vs. Classical Different Methods  Same Tree?? Molecular Clock

The End