Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

Slides:

Advertisements

Similar presentations

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Advertisements

Multiple Sequence Alignment (MSA) I519 Introduction to Bioinformatics, Fall 2012.

Multiple Sequence Alignment

CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.

BNFO 602 Multiple sequence alignment Usman Roshan.

Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.

Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

Lecture 8: Multiple Sequence Alignment

Introduction to Bioinformatics Algorithms Multiple Alignment.

1 Protein Multiple Alignment by Konstantin Davydov.

CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Multiple Sequences Alignment Ka-Lok Ng Dept. of Bioinformatics Asia University.

Expected accuracy sequence alignment

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction

Multiple Alignment. Outline Problem definition Can we use Dynamic Programming to solve MSA? Progressive Alignment ClustalW Scoring Multiple Alignments.

Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,

Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.

Multiple alignment: heuristics

Multiple sequence alignment

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

BNFO 602 Multiple sequence alignment Usman Roshan.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Multiple Sequence Alignment

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.

Introduction to Bioinformatics Algorithms Multiple Alignment.

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Scoring a multiple alignment Sum of pairsStarTree A A C CA A A A A A A CC CC.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple Sequence Alignment

Multiple Sequence Alignments

Multiple Alignment Modified from Tolga Can’s lecture notes (METU)

Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.

Expected accuracy sequence alignment Usman Roshan.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.

Introduction to Bioinformatics Algorithms Multiple Alignment Lecture 20.

Multiple Sequence Alignment

Multiple Sequence Alignment

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Introduction to Bioinformatics Algorithms Multiple Alignment.

Multiple sequence alignment (msa)

Multiple Sequence Alignment

Multiple Sequence Alignment

Presentation transcript:

Multiple Sequence Alignments

The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T O E.4 C

Multiple Sequence Alignments Algorithms

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming How do affine gaps generalize? VERY badly!  Require 2 N states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

2. Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

2. Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (we will describe more in detail later in the course)  Align on the tree x w y z ?

Aligning two alignments Given two alignments, m 1, m 2, can we find the optimal alignment under SOP scoring, with affine gaps? GTAGTCAGTCG x m 1 ---GTCACGTG y GTCGTCAGTCG z m 2 --CGCCAGGGG w --CGCCAGGGA v m 1 x GGGCACTGCAT y GGTTACGTC-- m 2 z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Aligning two alignments Given two alignments, m 1, m 2, can we find the optimal alignment under SOP scoring, with affine gaps? NP-hard! GTAGTCAGTCG x m 1 ---GTCACGTG y GTCGTCAGTCG z m 2 --CGCCAGGGG w --CGCCAGGGA v m 1 x GGGCACTGCAT y GGTTACGTC-- m 2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- Optimistic: assume no gap – don’t pay gap-open penalty Pessimistic: assume gap – pay gap-open penalty

Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

A* for Multiple Alignments Review of the A* algorithm v START GOAL Say that we have a gigantic graph G START: start node GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments a xy, a yz, a xz, …, and Score(M) = d(a xy ) + d(a xz ) + d(a yz ) +… Each of d(.) is smaller than the optimal edit distance

A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v) To compute h(v) For each pair of sequences x, y, Compute F R (x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i 1, i 2, …, i N ), h(v) becomes the sum of (N choose 2) F R scores

Consistency z x y xixi yjyj y j’ zkzk

Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

Some Resources Genome Resources Annotation and alignment genome browser at UCSC Specialized VISTA alignment browser at LBNL ABC—Nice Stanford tool for browsing alignments Protein Multiple Aligners CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate

Whole-genome alignment Rat—Mouse—Human

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned