CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Multiple Sequence Alignment
Molecular Evolution Revised 29/12/06
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
BNFO 602 Multiple sequence alignment Usman Roshan.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
Protein Functional Site Prediction The identification of protein regions responsible for stability and function is an especially important post-genomic.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
BNFO 602 Phylogenetics Usman Roshan.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Expected accuracy sequence alignment
BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University.
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
CIS786, Lecture 3 Usman Roshan.
Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.
BNFO 602 Multiple sequence alignment Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Probabilistic methods for phylogenetic trees (Part 2)
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CIS786, Lecture 4 Usman Roshan.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment School of B&I TCD May 2010.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Expected accuracy sequence alignment Usman Roshan.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Phylogenetic basis of systematics
Distance based phylogenetics
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Dr Tan Tin Wee Director Bioinformatics Centre
BNFO 602 Phylogenetics – maximum likelihood
BNFO 602 Phylogenetics Usman Roshan.
CS 394C: Computational Biology Algorithms
Tandy Warnow The University of Texas at Austin
Presentation transcript:

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David La of California State University at Pomona

Previously…

Evaluation of multiple sequence alignments Compare to benchmark “true” alignments Use simulation Measure conservation of an alignment Measure accuracy of phylogenetic trees How well does it align motifs?

ROSE Evolve sequences under an i.i.d. Markov Model Root sequence: probabilities given by a probability vector (for proteins default is Dayhoff et. al. values) Substitutions –Edge length are integers –Probability matrix M is given as input (default is PAM1*) –For edge of length b probabilty of x  y is given by M b xy Insertion and deletions: –Insertions and deletions follow the same probabilistic model –For each edge probability to insert is i ins. –Length of insertion is given by discrete probability distribution (normally exponential) –For edge of length b this is repeated b times. Model tree can be specified as input

Phylogeny accuracy

Running time

Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using technique Z then X is better than Y w.r.t. Z Motifs could be functional sites in proteins or functional regions in non-coding DNA

What is a “Functional Site”? Defining what constitutes a “functional site” is not trivial Residues that include and cluster around known functionality are clear candidates for functional sites We define a functional site as catalytic residues, binding sites, and regions that clustering around them.

Functional Sites (FS)

Phylogenetic motifs PMs are short sequence fragments that conserve the overall familial phylogeny Are they functional? How do we detect them?

Map PMs to the Structure Map Set PSZ Threshold

PMs in Various Structures

PMs and Traditional Motifs

TIM Phylogenetic Similarity False Positive Expectation

Cytochrome P450 Phylogenetic Similarity False Positive Expectation

Enolase Phylogenetic Similarity False Positive Expectation

Glycerol Kinase Phylogenetic Similarity False Positive Expectation

Myoglobin Phylogenetic Similarity False Positive Expectation

Evaluating alignments For a given alignment compute the PMs Determine the number of functional PMs Those identifying more functional PMs will be classified as better alignments

Running time

Functional PMs PAl=blue MUSCLE=red Both=green (a)=enolase, (b)ammonia channel, (c)=tri-isomerase, (d)=permease, (e)=cytochrome

Today More simulations… Comparison of MP and NJ trees on different protein alignments Simultaneous alignment and phylogeny reconstruction –Starting trees for POY –Boosting it with RecIDCM3

NJ vs MP on 50 taxa and 500 mean sequence length NJ MP

NJ vs MP on 100 taxa and 500 mean sequence length NJ MP

NJ vs MP on 400 taxa and 500 mean sequence length NJ MP

MP trees on 800 taxa alignments

Increasing sequence lengths on 50 taxa datasets

Increasing sequence lengths on 400 taxa datasets

Simultaneous alignment and phylogeny reconstruction---POY Performs TBR through tree space to search for better tree alignments Uses variant of progressive alignment without profiles –Assigns ancestral sequences to internal nodes using MP –Removes gaps in ancestral sequences Optional median alignment is possible

Starting trees for POY Poy-default (greedy method) Poy-approxbuild (faster greedy method) Heuristic maximum parsimony trees generated on the following alignments using the TNT program (TBR search with one saved tree): –ClustalW(fast distance estimation) –Muscle1(default): progressive alignment (BLASTZ scoring matrix) –Muscle2(default): improved iterative progressive alignment (BLASTZ scoring matrix) –Muscle1MP: progressive alignment (scoring matrix for parsimony: match=1, mismatch=0, gapopen=gapextend=-1) –Muscle2MP: improved iterative progressive alignment (parsimony scoring matrix as above) –Muscle1MP(CW-guidetree): Muscle1-MP on the ClustalW guide-tree (fast distance estimation)

Simulation study parameters Model trees: uniform random distribution and uniformly selected random edge lengths Model of evolution: HKY95 with insertions and deletions probabilities selected from a gamma distribution (see ROSE software package) Generated data: Settings of 250, 500, 1000 taxa, mean sequence lengths of 1000 and 2000, and avg branch lengths of 0.2 were selected. For each setting 1 dataset was produced. Criterion for branch length and sequence length selection: Evolutionary rate was selected such that the starting Poy tree was between 20% and 30% error rate-- -not too hard or easy. Mean sequence lengths of 1000 and 2000 are realistic for protein coding sequences.

Comparison of Poy to MUSCLE and ClustalW under simulation 250 taxa, 941 mean sequence length, 0.2 avg branch length

Comparison of Poy to MUSCLE and ClustalW under simulation 500 taxa, 981 mean sequence length, 0.2 avg branch length

Comparison of Poy to MUSCLE and ClustalW under simulation 1000 taxa, 993 mean sequence length, 0.2 avg branch length

Comparison of Poy to MUSCLE and ClustalW on real data 218 taxa RNA metazoan dataset

Comparison of Poy to MUSCLE and ClustalW on real data 585 taxa RNA archaea dataset

Comparison of Poy to MUSCLE and ClustalW on real data 1040 taxa RNA mitochondria dataset

Comparison of Poy to MUSCLE and ClustalW on real data 1766 taxa RNA metazoa dataset