Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

Aligning Alignments Soni Mukherjee 11/11/04

Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches) * s - (#gaps) * d Optimal alignment is the alignment with the maximum score

Dynamic Programming We want to align x 1 …x m and y 1 …y n D(i,j) = optimal score of aligning x 1 …x i and y 1 …y j Solution is D(m, n)

Dynamic Programming Three possible cases for computing D(i,j): C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Dynamic Programming Three possible cases for computing D(i,j): 1.x i aligns to y j x 1 …… x i-1 x i y 1 …… y j-1 y j 2.x i aligns to a gap x 1 …… x i-1 x i y 1 …… y j - 3.y j aligns to a gap x 1 …… x i - y 1 …… y j-1 y j C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Dynamic Programming Three possible cases for computing D(i,j): 1.x i aligns to y j x 1 …… x i-1 x i y 1 …… y j-1 y j 2.x i aligns to a gap x 1 …… x i-1 x i y 1 …… y j - 3.y j aligns to a gap x 1 …… x i - y 1 …… y j-1 y j D(i,j) = D(i-1, j-1) + m, if x i = y j -s, otherwise C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Dynamic Programming Three possible cases for computing D(i,j): 1.x i aligns to y j x 1 …… x i-1 x i y 1 …… y j-1 y j 2.x i aligns to a gap x 1 …… x i-1 x i y 1 …… y j - 3.y j aligns to a gap x 1 …… x i - y 1 …… y j-1 y j D(i,j) = D(i-1, j-1) + m, if x i = y j -s, otherwise D(i,j) = D(i-1, j) - d C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Dynamic Programming Three possible cases for computing D(i,j): 1.x i aligns to y j x 1 …… x i-1 x i y 1 …… y j-1 y j 2.x i aligns to a gap x 1 …… x i-1 x i y 1 …… y j - 3.y j aligns to a gap x 1 …… x i - y 1 …… y j-1 y j D(i,j) = D(i-1, j-1) + m, if x i = y j -s, otherwise D(i,j) = D(i-1, j) - d D(i,j) = D(i, j-1) - d C--GCCTAG-CT--AG CT-GC-TAT-CTTTAG

Dynamic Programming Inductive assumption: –D(i-1, j-1), D(i-1, j) and D(i, j-1) are optimal D(i, j) = max Where s(x i, y j ) = m if xi = yj; -s otherwise D(i-1, j-1) + s(x i, y j ) D(i-1, j) - d D(i, j-1) - d

Dynamic Programming Matrix D D(i-1,j-1)D(i, j-1) D(i-1, j) D(i,j) -d +s (X[i],Y[j])

Needleman-Wunsch Every non- decreasing path from (0,0) to (M,N) corresponds to an alignment of the two sequences y 1 ……………………………… y N x M ……………………………… x 1

Scoring Gaps More Accurately Linear gap model: Gap of length n incurs penalty p(n) = n*d

Scoring Gaps More Accurately Linear gap model: Gap of length n incurs penalty p(n) = n*d Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1)

Scoring Gaps More Accurately Linear gap model: Gap of length n incurs penalty p(n) = n*d Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(x i, y j ) max k=0…i-1 D(k, j) – p(i-k) max k=0…j-1 D(i, k) – p(j-k)

Scoring Gaps More Accurately Linear gap model: Gap of length n incurs penalty p(n) = n*d Convex gap model: For all n, p(n+1) - p(n) < p(n) - p(n-1) D(i, j) = max D(i-1, j-1) + s(x i, y j ) max k=0…i-1 D(k, j) – p(i-k) max k=0…j-1 D(i, k) – p(j-k) 3 Running time = O(N )

Affine Gaps p(n) = d + n*e d = gap open penalty e = gap extend penalty e d

Affine Gaps p(n) = d + n*e d = gap open penalty e = gap extend penalty Now we need three matrices: D(i, j) = score of alignment x 1 …x i to y 1 …y j if x i aligns to y j H(i, j) = score of alignment x 1 …x i to y 1 …y j if y j aligns to a gap V(i, j) = score of alignment x 1 …x i to y 1 …y j if x i aligns to a gap e d

Needleman-Wunsch with Affine Gaps D(i,j) = max H(i,j) = max V(i,j) = max D(i-1, j-1) + s(x i, y j ) H(i-1, j-1) + s(x i, y j ) V(i-1, j-1) + s(x i, y j ) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e

Needleman-Wunsch with Affine Gaps D(i,j) = max H(i,j) = max V(i,j) = max D(i-1, j-1) + s(x i, y j ) H(i-1, j-1) + s(x i, y j ) V(i-1, j-1) + s(x i, y j ) D(i, j-1) - d H(i, j-1) - e V(i, j-1) - d D(i-1, j) - d H(i-1, j) - d V(i-1, j) - e Running time = O(MN)

Affine Gaps Essentially, when there is a gap, the algorithm looks back one space to determine whether or not this gap opened a gap or continued a previous one: - xStartsz xStarts z xContinues y -new gapy - new gap- -old gap

Multiple Sequence Alignment Given N sequences x 1, x 2,…, x N, insert gaps in each sequence x i such that: –All sequences have the same length L –Global score is maximum Motivation: –Faint similarity between two sequences becomes significant if present in many –Multiple alignments can help improve pairwise alignments

Induced Pairwise Alignments Multiple alignment: x: AC_GCGG_C y: AC_GC_GAG z: GCCGC_GAG Induces three pairwise alignments: x: ACGCGG_Cx: AC_GCGG_Cy: AC_GCGAG y: ACGC_GACz: GCCGC_GAGz: GCCGCGAG

Sum of Pairs Sum of Pairs score of a multiple alignment is the sum of the scores of all induced pairwise alignments: S(m) =  k<l s(m k, m l ) where s(m k, m l ) = score of induced alignment (k, l)

Multidimensional Dynamic Programming Example in 3-D (3 sequences) 7 neighbors per cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, -), F(i,j,k-1)+S( -, -, x k ) }

Multidimensional Dynamic Programming L = length of each sequence N = number of sequences Size of matrix = L N Neighbors per cell = 2 N – 1 Running time = O(2 N L N )

Progressive Alignment Align two of the sequences x i and x j Fix that alignment Align a third sequence/alignment to the alignment x i x j Repeat until all sequences are aligned

Progressive Alignment When evolutionary tree is known: Align closest first, in order of the tree: –Align (x, y) –Align (w, z) –Align (xy, wz) y z x w

Alignment three sequences Multidimensional Dynamic Programming Progressive Alignment Y X Z Y X Z

Aligning three sequences Multidimensional Dynamic Programming Progressive Alignment Y X Z Y X

Sequence vs Alignment Score at each entry adds score of aligning the column in y to the column in the alignment xz y N ……………………………… y 1 x 1 ……………………………… x M z 1 ……………………………… z L

Example i th Ietter of y: A j th column of xz: D(i, j) = max -A-A D(i-1, j-1) – d + s(A, A) D(i-1, j) – d – d D(i, j-1) + 0 – d

Affine Gaps i th letter of y matched with j th column of xz (j-1) th column of xz gapped y: - A x: - - z: A A This induces the yx alignment: y: - A x: - -

Affine Gaps Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - xStartsz xStarts z xContinues y -new gapy - new gap- -old gap

Affine Gaps Recall for pairwise alignment, there were three cases when determining whether a gap starts or continues a gap: - xStartsz xStarts z xContinues y -new gapy - new gap- -old gap When aligning a sequence and an alignment, a fourth case arises: - xStarts or continues - -a gap???

Aligning Alignments John D. Kececioglu and Weiqing Zhang, 1998 Optimistic and pessimistic gap counts for sequence vs alignment Exact gap counts for sequence vs alignment

Sequence vs Alignment A = a 1 … a m is a sequence of length m B is a multiple alignment of length n of k sequences –represented by a k x n matrix –each entry b ij is either a letter or gap

Optimistic and Pessimistic Gap Counts When we have - x - Optimistic gap count assumes that this continues a previous gap Pessimistic gap count assumes this starts a new gap Running time = O(kmn)

Exact Gap Counts Recall matrices: D(i, j) = score of alignment a 1 …a i to b 1 …b j if a i aligns to b j H(i, j) = score of alignment a 1 …a i to b 1 …b j if b j aligns to a gap V(i, j) = score of alignment a 1 …a i to b 1 …b j if a i aligns to a gap Only ways to get are the cases HH, HV, and HD, generalized as HX - x -

Exact Gap Counts Three possibilities: –… DH…HX –… VH…HX –H………HX

Exact Gap Counts Three possibilities: –… DH…HX –… VH…HX –H………HX Is b ij the first character in its row encountered during the run?

Exact Gap Counts Three possibilities: –… DH…HX –… VH…HX –H………HX Is b ij the first character in its row encountered during the run? Algorithm with lots of matrices runs in O(kn + kmn + mn ) 22

Comparison Sequence vs Alignment Alignment vs Alignment

Comparison Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment … - - x … - - -

Comparison Sequence vs Alignment Only three types of paths can cause Alignment vs Alignment Any path can cause … - - x … - - - … - - x … - - -

Aligning Alignments Exactly John Kececioglu and Dean Starrett, 2003 Aligning two alignments is NP-complete Exact algorithm Time and space complexity Pruning Results

NP-Completeness Reduction from the Maximum Cut Problem Still NP-compete if: –Strings are of length at most 5 –Every row has at most 3 gaps –At most 1 gap in the interior of each string

Exact Algorithm Sufficient to know relative order of the rightmost element in the row for each pair: x: - A y: - - If x’s rightmost element is to the right of y’s rightmost element, this is an extension Otherwise, it is a startup

Shapes A: -AGGCTATCACCTGACCTCCAGG B: TAG-CTATCAC--GACCGC---- C: CAG-CTATCAC--GACCGC---- D: CAGCCTATCACC-GAACGCCA--

Shapes A: -AGGCTATCACCTGACCTCCAGG B: TAG-CTATCAC--GACCGC---- C: CAG-CTATCAC--GACCGC---- D: CAGCCTATCACC-GAACGCCA-- S 1 = {B, C} S 2 = {D} S 3 = {A} S = (S 1, S 2, S 3 )

Shapes A shape s for an alignment with k rows is an ordered partition s =(s 1, s 2, …, s p ) where 1 <= p <= k If we know s, we know for each gap whether it starts or continues a gap

Exact Algorithm A is a k x m multiple alignment B is a l x n multiple alignment C(i, j, s) = cost of an optimal alignment of a1…ai and b1…bj ending in shape s Instead of entries (i, j, s), think of entries (i, j), each with a shape list L(i, j)

Exact Algorithm C(i, j)C(i+1, j) C(i, j+1) C(i+1, j+1)

Exact Algorithm For each s in L(i, j): –For each next-entry (i, j+1), (i+1, j), and (i+1, j+1) Add resulting shape t to next-entry’s shape list. Find s in L(m, n) that minimizes C(i, j, s) to find optimum cost

Time and Space Complexity Time = Space = Time / k O((3 + sqrt(2)) (n-k) k ), if k < n O((3 + sqrt(2)) k n ), if k >= n k23/2 n2-1/2

Pruning Dominance Pruning - uses a dominance relation on pairs of shapes Bound Pruning - exploits upper and lower bounds on the cost of an optimal alignment Combining these yields fastest exact algorithm in practice

Dominance Pruning Extension - a series of insertions, deletions, and substitutions of columns that extend the alignment into an entry Shape s dominates shape t if, for all extensions p, C(s p) <= C(t p) s is at least as good as t on all extensions oo

Bound Pruning L(s) - lower bound on C(s p) for all p –Optimistic algorithm on reverse of input U - upper bound on the cost of the optimal alignment of A and B –Minimum of optimistic, pessimistic, and trivial alignment scores If L(s) > U, remove s o

Reducing the Space Exact Algorithm with dominance pruning can be run in linear space in the number of columns of the input, without increasing the time complexity Not possible with bound pruning, which uses quadratic-size tables to lookup lower bounds

Results Tractable in practice Ceiling phenomenon - number of shapes does not grow once the number of rows exceeds a threshold

Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

Similar presentations

Presentation on theme: "Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

Similar presentations

Presentation on theme: "Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)"— Presentation transcript:

Similar presentations

About project

Feedback