# Sequence Alignment Arthur W. Chou Tunghai University Fall 2005.

## Presentation on theme: "Sequence Alignment Arthur W. Chou Tunghai University Fall 2005."— Presentation transcript:

Sequence Alignment Arthur W. Chou Tunghai University Fall 2005

Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: GCGCATGGATTGAGCGA GCGCATGGATTGAGCGA TGCGCCATTGATGACCA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A

Why align sequences? Lots of sequences don’t have known ancestry, structure, or function. A few of them do.  If they align, they are similar.  If they are similar, they might have the same ancestry, similar structure or function. ancestry, similar structure or function.  If one of them has known ancestry, structure, or function, then alignment to the others yields function, then alignment to the others yields insight about them. insight about them.

Alignments -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A Three kinds of match: Exact matches Mismatches Indels (gaps)

Choosing Alignments There are many possible alignments For example, compare: -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-Ato------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA-- Which one is better?

Scoring Alignments Similar sequences evolved from a common ancestor Similar sequences evolved from a common ancestor Evolution changed the sequences from this ancestral sequence by mutations: Evolution changed the sequences from this ancestral sequence by mutations:  Replacement: one letter replaced by another  Deletion: deletion of a character  Insertion: insertion of a character Scoring of sequence similarity should examine how many and which operations took place Scoring of sequence similarity should examine how many and which operations took place

Simple Scoring Rule Score each position independently: Match: +1 Match: +1 Mismatch: -1 Mismatch: -1 Indel: -2 Indel: -2 Score of an alignment is sum of position scores

Example -GCGC-ATGGATTGAGCGATGCGCCATTGAT-GACC-A Score: (+1  13) + (-1  2) + (-2  4) = 3 Score: (+1  13) + (-1  2) + (-2  4) = 3------GCGCATGGATTGAGCGATGCGCC----ATTGATGACCA-- Score: (+1  5) + (-1  6) + (-2  11) = -23 Score: (+1  5) + (-1  6) + (-2  11) = -23

More General Scores The choice of +1,-1, and -2 scores is quite arbitrary The choice of +1,-1, and -2 scores is quite arbitrary Depending on the context, some changes are more plausible than others Depending on the context, some changes are more plausible than others  Exchange of an amino-acid by one with similar properties (size, charge, etc.) vs.  Exchange of an amino-acid by one with opposite properties Probabilistic interpretation: How likely is one alignment versus another ? Probabilistic interpretation: How likely is one alignment versus another ?

Dot Matrix Method A dot is placed at each position where two residues match. A dot is placed at each position where two residues match. It's a visual aid. The human eye can rapidly identify similar regions in sequences. It's a visual aid. The human eye can rapidly identify similar regions in sequences. It's a good way to explore sequence organization: e.g. sequence repeats. It's a good way to explore sequence organization: e.g. sequence repeats. It does not provide an alignment. It does not provide an alignment. THEFA-TCAT ||||| |||| THEFASTCATTHEFA-TCAT THEFASTCAT This method produces dot-plots with too much noise to be useful  The noise can be reduced by calculating a score using a window of residues.  The score is compared to a threshold or stringency.

Dot Matrix Representation Produces a graphical representation of similarity regions Produces a graphical representation of similarity regions The horizontal and vertical dimensions correspond to the compared sequences The horizontal and vertical dimensions correspond to the compared sequences A region of similarity stands out as a diagonal A region of similarity stands out as a diagonal Tissue-Type plasminogen Activator Urokinase-Type plasminogen Activator

Dot Matrix or Dot-plot  Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence  A colour is set into a rectangular array according to the score of the aligned windows  Each window of the first sequence is aligned (without gaps) to each window of the 2nd sequence  A colour is set into a rectangular array according to the score of the aligned windows THE ||| THE ||| THE Score: 23 THE HEF THE HEF Score: -5 CAT THE CAT THE Score: -4 HEF THE HEF THE Score: -5

Dot Matrix Display Diagonal rows ( ) of dots  Diagonal rows ( ) of dots reveal sequence similarity reveal sequence similarity or repeats. or repeats.  Anti-diagonal rows ( ) of dots represent inverted of dots represent inverted repeats. repeats.  Isolated dots represent random similarity. random similarity. H C G E T F G R W F T P E W K C G P T F G R I A C G E M

Dot matrix web server http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html We can filter it by using a sliding window looking for longer strings of matches and eliminates random matches

Common Longest Common Subsequence Sequence A: nematode_knowledge Sequence B: empty_bottle n e m a t o d e _ k n o w l e d g e | | | | | | | | | | | | | | e m p t y _ b o t t l e e m p t y _ b o t t l e  LCS Alignment with match score 1, mismatch score 0, and gap penalty 0 mismatch score 0, and gap penalty 0

What is an algorithm? A step-by-step description of the procedures to accomplish a task. A step-by-step description of the procedures to accomplish a task. Properties: Properties: 1.Determination of output for each input 2.Generality 3.Termination Criteria: Criteria: 1.Correctness (proof, test, etc.) 2.Time efficiency (no. of steps is small) 3.Space efficiency (spaced used is small)

Naïve algorithm: exhaustive search G C G A A T G G A T T G A G C G T G C G A A T G G A T T G A G C G T T G A G C C A T T G A T G A C C A T G A G C C A T T G A T G A C C A i j Worst case time complexity is ~ 2 i j j i j i j j i i.............. sequences of length “n” 2n

Dynamic programming algorithms for pairwise sequence alignment Similar to Longest Common Subsequence Similar to Longest Common Subsequence Introduced for biological sequences by Introduced for biological sequences by  S. B. Needleman & C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453 (1970)

Dynamic Programming Optimality substructure Optimality substructure Reduction to a “small” number of sub-problems Reduction to a “small” number of sub-problems Memorization of solutions to sub-problems in a table Memorization of solutions to sub-problems in a table Table look-up and tracing Table look-up and tracing - G C G C – A T G G A T T G A G C G A T G C G C C A T T G A T – G A C C - A - G C G C – A T G G A T T G A G C G A T G C G C C A T T G A T – G A C C - A Optimality Sub-structure

Recursive LCS int lcs_len ( i, j ) { if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ; if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’ ) return 0 ; else else if (A[ i ] == B[ j ] ) if (A[ i ] == B[ j ] ) return ( 1 + lcs_len ( i+1, j+1 ) ) ; else else return max ( lcs_len ( i+1, j ), lcs_len ( i, j+1 ) lcs_len ( i, j+1 ) ); );} lcs_len( i, j ): length of LCS from i-th position onward in String A and from j-th position onward in String B

Reduction to Subproblems int lcs_len ( String A, String B ) { return subproblem ( 0, 0 ); } int subproblem ( int i, int j ) { if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) return 0; else else if ( A[ i ] == B[ j ] ) return (1 + subproblem ( i+1, j+1 )); return (1 + subproblem ( i+1, j+1 )); else return max ( subproblem ( i+1, j ), subproblem ( i, j+1 ) ); subproblem ( i, j+1 ) );}

Memorizing the solutions : Matrix L[ i, j ] = -1 ; // initializing the memory device int subproblem ( int i, int j ) { if ( L[i, j] < 0 ) { if ( L[i, j] < 0 ) { if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i, j] = 0; if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[i, j] = 0; else if ( A[ i ] == B[ j ] ) else if ( A[ i ] == B[ j ] ) L[i, j] = 1 + subproblem(i+1, j+1); L[i, j] = 1 + subproblem(i+1, j+1); else L[i, j] = max( subproblem(i+1, j), else L[i, j] = max( subproblem(i+1, j), subproblem(i, j+1)); subproblem(i, j+1)); } return L[ i, j ] ; } return L[ i, j ] ;}

Iterative LCS: Table Look-up To get the length of LCS of A and B To get the length of LCS of A and B { first allocate storage for the matrix L; first allocate storage for the matrix L; for each row i from m downto 0 for each row i from m downto 0 for each column j from n downto 0 for each column j from n downto 0 if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ or B[ j ] == ‘\0’) L[ i, j ] = 0; else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1]; else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1]; else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]); } return L[0, 0]; return L[0, 0]; }

Iterative LCS: Table Look-up int lcs_len ( String A, String B ) // find the length int lcs_len ( String A, String B ) // find the length { // First allocate storage for the matrix L; // First allocate storage for the matrix L; for ( i = m ; i >= 0 ; i-- ) // A has length m+1 for ( i = m ; i >= 0 ; i-- ) // A has length m+1 for ( j = n ; j >= 0 ; j-- ) { // B has length n+1 for ( j = n ; j >= 0 ; j-- ) { // B has length n+1 if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0; if (A[ i ] == ‘\0’ || B[ j ] == ‘\0’) L[ i, j ] = 0; else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1]; else if (A[ i ] == B[ j ]) L[ i, j ] = 1 + L[i+1, j+1]; else L[ i, j ] = max(L[i+1, j], L[i, j+1]); else L[ i, j ] = max(L[i+1, j], L[i, j+1]); } return L[0, 0]; return L[0, 0]; }

Dynamic Programming Algorithm  L[i, j] = 1 + L[i+1, j+1], if A[ i ] == B[ j ] ;  L[i, j] = max ( L[i+1, j], L[i, j+1] ) otherwise L[i+1, j+1] L[ i, j ]L[ i, j+1 ] L[ i+1, j ] jj+1 i i+1 B A Matrix L

n e m a t o d e _ k n o w l e d g e e 7 7 6 5 5 5 5 5 4 3 3 3 2 2 2 1 1 1 0 m 6 6 6 5 5 4 4 4 4 3 3 3 2 2 1 1 1 1 0 p 5 5 5 5 5 4 4 4 4 3 3 3 2 2 1 1 1 1 0 t 5 5 5 5 5 4 4 4 4 3 3 3 2 2 1 1 1 1 0 y 4 4 4 4 4 4 4 4 4 3 3 3 2 2 1 1 1 1 0 _ 4 4 4 4 4 4 4 4 4 3 3 3 2 2 1 1 1 1 0 b 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 1 1 1 0 o 3 3 3 3 3 3 3 3 3 3 3 3 2 2 1 1 1 1 0 t 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 0 t 3 3 3 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1 0 l 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 0 e 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Obtain the subsequence Sequence S = empty; // the LCS i = 0; j = 0; while ( i < m && j < n) { if ( A[ i ] == B[ j ] ) { if ( A[ i ] == B[ j ] ) { add A[i] to end of S; add A[i] to end of S; i++; j++; i++; j++; } else } else if ( L[i+1, j] >= L[i, j+1]) i++; if ( L[i+1, j] >= L[i, j+1]) i++; else j++; else j++;}

n e m a t o d e _ k n o w l e d g e e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o | \| | | | | \| | | | | | \| | \| m o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o | | \| | | | | | | | | | | | | | | | p o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | | | | | | | | | | | | | | | | t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | \| | | | | | | | | | | | | | y o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | | | | | | | | | | | | | | | | _ o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o | | | | | | | | \| | | | | | | | | | b o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | | | | | | | | | | | | | | | | o o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o | | | | | \| | | | | | | | | | | | t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | \| | | | | | | | | | | | | t o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o | | | | \| | | | | | | | | | | | | | l o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o | | | | | | | | | | | | | \| | | | | e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o | \| | | | | \| | | | | | \| | \| o o o o o o o o o o o o o o o o o o o

Dynamic Programming with scores and penalties VADL……..KTA NAKM….DALTj i x y

Dynamic Programming with scores and penalties  from ‘i-th’ pos. in A and ‘j-th’ pos. in B onward s ( A[i], B[j] ) + S[i+1, j+1] s ( A[i], B[j] ) + S[i+1, j+1] S[i, j] = max max { S[i+x, j] – w( x ); gap x in sequence B } gap x in sequence B } max { S[i, j+y] – w( y ); max { S[i, j+y] – w( y ); gap y in sequence A } gap y in sequence A } best score from i, j onward w : penalty function s : score

Algorithm for simple gap penalty If for each gap, the penalty is a fixed constant “c”, then s(A[ i ], B[ j ]) + S[i+1, j+1]; s(A[ i ], B[ j ]) + S[i+1, j+1]; S[i, j] = max S[ i+1, j ] – c ; // one gap S[ i, j+1 ] – c ; // one gap S[ i, j+1 ] – c ; // one gap

Table Tracing  To do table tracing based on similarity matrix of amino acids, we re-define S[i, j] to be the optimal score of choosing the match of A[i] with B[j]. S[ i, j ] = s (A[ i ], B[ j ]) + // s : score S[i+1, j+1] // w : gap penalty S[i+1, j+1] // w : gap penalty max { S[i+1+x, j+1] – w( x ); max { S[i+1+x, j+1] – w( x ); + max gap x in sequence B } max { S[i+1, j+1+y] – w( y ); max { S[i+1, j+1+y] – w( y ); gap y in sequence A } gap y in sequence A }

Diagram s[i, j] S[i+1,j+1] Matrix S: i j i+1 j+1

Summation operation 1. Start at lower right corner. 2. Move diagonally up one position. 3. Find largest value in either  row segment starting diagonally below current position and extending to the right or  row segment starting diagonally below current position and extending to the right or  column segment starting diagonally below current position and extending down.  column segment starting diagonally below current position and extending down. 4. Add this value to the value in the current cell. 5. Repeat steps 3 and 4 for all cells to the left in current row and all cells above in current column. 6. If we are not in the top left corner, go to step 2.

----VHGQKV

----VAHGQKVA

Use of dynamic programming to evaluate homology between pairs of sequences If we just want to know maximum match possible between two sequences, then we don’t need to do trace-back but can just look at the highest value in the first row or column (“match score”). This represents the best possible alignment score. If we just want to know maximum match possible between two sequences, then we don’t need to do trace-back but can just look at the highest value in the first row or column (“match score”). This represents the best possible alignment score.

Gap penalty alternatives : constant gap penalty for gap > 1 constant gap penalty for gap > 1 gap penalty proportional to gap size (affine gap penalty) gap penalty proportional to gap size (affine gap penalty)  one penalty for starting a gap (gap opening penalty)  different (lower) penalty for adding to a gap (gap extension penalty)  dynamic programming algorithm can be made more efficient

Gap penalty alternatives (cont.) gap penalty proportional to gap size and sequence gap penalty proportional to gap size and sequence  for nucleic acids, can be used to mimic thermodynamics of helix formation.  two kinds of gap opening penalties  one for gap closed by AT, different for GC.  different gap extension penalty.

End gaps Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps. Some programs treat end gaps as normal gaps and apply penalties, other programs do not apply penalties for end gaps.

End gaps (cont.)  Can determine which a program does by adding extra (unmatched) bases to the end of one sequence and seeing if match score changes.  Penalties for end gaps appropriate for aligned sequences where ends "should match“.  Penalties for end gaps inappropriate when surrounding sequences are expected to be different (e.g., conserved exon surrounded by varying introns).

Global vs. Local Similarity Should result of alignment include all amino acids or proteins or just those that match? Should result of alignment include all amino acids or proteins or just those that match?  If yes, a global alignment is desired  If no, a local alignment is desired  Global alignment is accomplished by including negative scores for “mismatched” positions, thus scores get worse as we move away from region of match (local alignment).  Instead of starting trace-back with highest value in first row or column, start with highest value in entire matrix, stop when score hits zero.

Local Alignment  From ‘i-th’ pos. in A and ‘j-th’ pos. in B onward s ( A[i], B[j] ) + H [i+1, j+1] s ( A[i], B[j] ) + H [i+1, j+1] H [i, j] = max max { H [i+x, j] – w( x ); gap x in sequence B } gap x in sequence B } max { H [i, j+y] – w( y ); max { H [i, j+y] – w( y ); gap y in sequence A } gap y in sequence A } w : penalty function s : score Best score of any prefix of the subsequence from i, j onward. 0