Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct. 6, 2005 ChengXiang Zhai Department of Computer Science University.
Advertisements

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Lecture 8 Alignment of pairs of sequence Local and global alignment
COFFEE: an objective function for multiple sequence alignments
Multiple alignment: heuristics. Consider aligning the following 4 protein sequences S1 = AQPILLLV S2 = ALRLL S3 = AKILLL S4 = CPPVLILV Next consider the.
BNFO 602 Multiple sequence alignment Usman Roshan.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures CLUSTAL W Algorithm Lecturer:
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
Sequence Analysis Tools
Multiple alignment: heuristics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Multiple sequence alignment
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
Home Work I. Running Blast with BioPerl Input: 1) Sequence or Acc.Num. 2) Threshold (E value cutoff) Output: 1) Blast results – sequence names, alignment.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple sequence alignment
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Construction of Substitution Matrices
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
BIOINFORMATICS Ayesha M. Khan Spring Lec-6.
Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Multiple Sequence Alignment
In Bioinformatics use a computational method - Dynamic Programming.
Pairwise Sequence Alignment (cont.)
Multiple Sequence Alignment (II)
Multiple Sequence Alignment
Presentation transcript:

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Today’s topic: Approximate Algorithms for Multiple Alignment Feng-Doolittle alignment Improvements: –Profile alignment –Iterative refinement ClustalW (A multiple alignment tool)

Feng-Doolittle Progressive Alignment Step 1: Compute all possible pairwise alignments Step 2: Convert alignment scores to distances Step 3: Construct a “guide tree” by clustering Step 4: Progressive alignment based on the guide tree (bottom up) Note that variations are possible at each step!

Detour…. Some background about clustering

How to Compute Group Similarity? Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair complete-link algorithm: s(g1,g2)= similarity of the farthest pair average-link algorithm: s(g1,g2)= average of similarity of all pairs Three Popular Methods:

Three Methods Illustrated Single-link algorithm ? g1 g2 complete-link algorithm …… average-link algorithm

Comparison of the Three Methods Single-link –“Loose” clusters –Individual decision, sensitive to outliers Complete-link – “Tight” clusters –Individual decision, sensitive to outliers Average-link –“In between” –Group decision, insensitive to outliers Which one is the best? Depends on what you need!

Feng-Doolittle: Clustering Example X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X X1X1 X2X2 X3X3 X4X4 X1X1 X2X2 X3X3 X4X4 X5X5 X5X X5X5 4.5 X5X5 Length normalization Similarity matrix (from pairwise alignment)

Feng-Doolittle: How to generate a multiple alignment? At each step consider all possible pairwise alignments and pick the best one (3 cases): –Sequence vs. sequence –Sequence vs. group –group vs. group “Once a gap, always a gap” –gap is replaced by a neutral symbol X –X can be matched with any symbol, including a gap without penalty

Problems with Feng-Doolittle All alignments are completely determined by pairwise alignment (restricted search space) No backtracking (subalignment is “frozen”) –No way to correct an early mistake –Non-optimality: Mismatches and gaps at highly conserved region should be penalized more, but we can’t tell where is a highly conserved region early in the process  Profile alignment  Iterative refinement

Profile Alignment Aligning two alignments/profiles Treat each alignment as “frozen” Alignment them with a possible “column gap” Fixed for any two given alignments Only need to optimize this part

Iterative Refinement Re-assigning a sequence to a different cluster/profile Repeatedly do this for a fixed number of times or until the score converges Essentially to enlarge the search space

ClustalW: A Multiple Alignment Tool Essentially following Feng-Doolittle –Do pairwise alignment (dynamic programming) –Do score conversion/normalization (Kimura’s model) –Construct a guide tree (neighbour-journing clustering) –Progressively align all sequences using profile alignment

ClustalW Heuristics Avoid penalizing minority sequences –Sequence weighting –Consider “evolution time” (using different sub. Matrices) More reasonable gap penalty, e.g., –Depends on the actual residues at or around the positions (e.g., hydrophobic residues give higher gap penalty) –Increase the gap penalty if it’s near a well-conserved region (e.g., perfectly aligned column) Postpone low-score alignment until more profile information is available.

Heuristic 1: Sequence Weighting Motivation: address sample bias Idea: –Down weighting sequences that are very similar to other sequences –Each sequence gets a weight –Scoring based on weights w1: peeksavtal w2: peeksavlal w3:egewglvlhv w4:aaektkirsa Sequence weighting

Heuristic 2: Sophisticated Gap Weighting Initially, –GOP: “gap open penalty” –GEP: “gap extension penalty” Adjusted gap penalty –Dependence on the weight matrix –Dependence on the similarity of sequences –Dependence on lengths of the sequences –Dependence on the difference in the lengths of the sequences –Position-specific gap penalties –Lowered gap penalties at existing gaps –Increased gap penalties near existing gaps –Reduced gap penalties in hydrophilic stretches –Residue-specific penalties

Gap Adjustment Heuristics Weight matrix: –Gap penalties should be comparable with weights Similarity of sequences –GOP should be larger for closely related sequences Sequence length –Long sequences tend to have higher scores Difference in sequence lengths –Avoid too many gaps in the short sequence GOP = {GOP+log[min(N,M)]}* (avg residue mismatch score) * (percent identity scaling factor) N, M = sequence lengths GEP = GEP *[1.0+|log(N/M)|] N>M

Gap Adjustment Heuristics (cont.) Position-specific gap penalties –Lowered gap penalties at existing gaps –Increased gap penalties near existing gaps –Reduced gap penalties in hydrophilic stretches (5 AAs) –Residue-specific penalties (specified in a table) GOP = GOP * 0.3 *(no. of sequences without a gap/no. of sequences) GOP = GOP * {2+[(8-distance from gap) *2]/8} GOP = GOP * 1/3 If no gaps, and one sequence has a hydrophilic stretch GOP = GOP * avgFactor If no gaps and no hydrophilic stretch. Average over all the residues at the position

Heuristic 3: Delayed Alignment of ‘ Divergent Sequences Divergence measure: Average percentage of identity with any other sequence Apply a threshold (e.g., 40% identity) to detect divergent sequences(“outliers”) Postpone the alignment of divergent sequences until all of the rest have been aligned