 # Developing Pairwise Sequence Alignment Algorithms

## Presentation on theme: "Developing Pairwise Sequence Alignment Algorithms"— Presentation transcript:

Developing Pairwise Sequence Alignment Algorithms
Dr. Nancy Warter-Perez

Developing Pairwise Sequence Alignment Algorithms
Outline Overview of global and local alignment References for sequence alignment algorithms Discussion of Needleman-Wunsch iterative approach to global alignment Discussion of Smith-Waterman recursive approach to local alignment Discussion of how LCS Algorithm can be extended for Global alignment (Needleman-Wunsch) Local alignment (Smith-Waterman) Group assignments for project Developing Pairwise Sequence Alignment Algorithms

Overview of Pairwise Sequence Alignment
Dynamic Programming Applied to optimization problems Useful when Problem can be recursively divided into sub-problems Sub-problems are not independent Needleman-Wunsch is a global alignment technique that uses an iterative algorithm and no gap penalty (could extend to fixed gap penalty). Smith-Waterman is a local alignment technique that uses a recursive algorithm and can use alternative gap penalties (such as affine). Smith-Waterman’s algorithm is an extension of Longest Common Substring (LCS) problem and can be generalized to solve both local and global alignment. Note: Needleman-Wunsch is usually used to refer to global alignment regardless of the algorithm used. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
References An Introduction to Bioinformatics Algorithms (Computational Molecular Biology) Neil C. Jones, Pavel Pevzner Computational Molecular Biology – An Algorithmic Approach, Pavel Pevzner Introduction to Computational Biology – Maps, sequences, and genomes, Michael Waterman Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology, Dan Gusfield Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Classic Papers Needleman, S.B. and Wunsch, C.D. A General Method Applicable to the Search for Similarities in Amino Acid Sequence of Two Proteins. J. Mol. Biol., 48, pp , ( Smith, T.F. and Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol., 147, pp , 1981.( Developing Pairwise Sequence Alignment Algorithms

Why search sequence databases?
I have just sequenced something. What is known about the thing I sequenced? I have a unique sequence. Is there similarity to another gene that has a known function? I found a new protein sequence in a lower organism. Is it similar to a protein from another species? Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method
Output: An alignment of two sequences is represented by three lines The first line shows the first sequence The third line shows the second sequence. The second line has a row of symbols. The symbol is a vertical bar wherever characters in the two sequences match, and a space where ever they do not. Dots (or dashes) may be inserted in either sequence to represent gaps. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 1)
For example, the two hypothetical sequences abcdefghajklm abbdhijk could be aligned like this || | | || abbd...hijk As shown, there are 6 matches, 2 mismatches, and one gap of length 3. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 2)
The alignment can be scored according to a payoff matrix for fixed scoring. You can also use PAM or BLOSUM. \$payoff = {match => \$match, mismatch => \$mismatch, gap_open => \$gap_open, gap_extend => \$gap_extend}; For correct operation, an algorithm is created such that the match must be positive and the other payoff entities must be negative. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 3)
Example Given the payoff matrix \$payoff = {match => 4, mismatch => -3, gap_open => -2, gap_extend => -1}; What is the alignment and what is the alignment score for the Following two sequences? Sequence 1: abcdefghajklm Sequence 2: abbdhijk Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 4)
The sequences abcdefghajklm abbdhijk are aligned and scored like this a b c d e f g h a j k l m | | | | | | a b b d h i j k match mismatch gap_open gap_extend for a total score of = 13. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 5)
The algorithm should guarantee that no other alignment of these two sequences has a higher score under this payoff matrix. Developing Pairwise Sequence Alignment Algorithms

Three steps in Dynamic Programming
1. Initialization 2. Matrix fill or scoring 3. Traceback and alignment Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Ends-Free Global Alignment with Fixed Scoring Two sequences will be aligned. GGATCGA (sequence #1 = V) GAATTCAGTTA (sequence #2 = W) A simple fixed scoring scheme will be used (vi, wj) = 1 if the residue at position i of sequence #1 (vi) is the same as the residue at position j of the sequence #2 (wj) – called match score (vi, wj) = 0 if vi ≠ wj – called mismatch score (vi, -) = (-, wj) = 0 – called gap penalty Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Matrix fill step: Each position Si,j is defined to be the MAXIMUM score at position i,j Si,j = MAXIMUM [ Si-1, j-1 + (vi, wj) (match or mismatch in the diagonal) Si, j-1 + w (gap in sequence #1 = V) Si-1, j + w (gap in sequence #2 = W)] column row Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Initialization step: 1) Create Matrix with m + 1 columns and n + 1 rows, where n = number of letters in sequence 1 and m = number of letters in sequence 2. 2) First column and first row will be filled with 0’s (for ends-free alignment). Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Scoring Step: Fill in row by row: Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Traceback Step: Seq#1 G G A - T C - G - - A | | | | | | Seq#2 G A A T T C A G T T A Developing Pairwise Sequence Alignment Algorithms

Global Alignment output file
Global: HBA_HUMAN vs HBB_HUMAN Score: HBA_HUMAN VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44 |:| :|: | | |||| : | | ||| |: : :| |: :| HBB_HUMAN VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43 HBA_HUMAN HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83 | ||| |: :|| ||||| | :: :||:|:: : | HBB_HUMAN SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128 |:|| || ||| ||:|| : |: || | |||| | |: | HBB_HUMAN SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133 HBA_HUMAN LASVSTVLTSKYR :| |: | || HBB_HUMAN VAGVANALAHKYH %id = %similarity = (88/139 *100) Overall %id = 43.15; Overall %similarity = (88/146 *100) Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
LCS Problem (review) Similarity score si-1,j si,j = max { si,j-1 si-1,j-1 + 1, if vi = wj Developing Pairwise Sequence Alignment Algorithms

Extend LCS to Global Alignment
si-1,j + (vi, -) si,j = max { si,j-1 + (-, wj) si-1,j-1 + (vi, wj) (vi, -) = (-, wj) = - = fixed gap penalty (vi, wj) = score for match or mismatch – can be fixed, from PAM or BLOSUM Developing Pairwise Sequence Alignment Algorithms

Global Alignment Alternatives
Ends-free alignment – don’t penalize gaps at the beginning or end Initialize first row and column of S to 0 Search last row and column for maximum score Regular global alignment – score end to end (penalize gaps at beginning and end) Initialize first row and column of S with gap penalty Alignment score is in the lower right corner of S Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (1 of 3)
Match = 1 Mismatch = 0 Gap = 0 Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (2 of 3)
Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (3 of 3)
From page 446: It is apparent that the above array operation can begin at any of a number of points along the borders of the array, which is equivalent to a comparison of N-terminal residues or C-terminal residues only. As long as the appropriate rules for pathways are followed, the maximum match will be the same. The cells of the array which contributed to the maximum match, may be determined by recording the origin of the number that was added to each cell when the array was operated upon. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Smith-Waterman Algorithm Advances in Applied Mathematics, 2: (1981) The Smith-Waterman algorithm is a local alignment tool used to obtain sensitive pairwise similarity alignments. Smith-Waterman algorithm uses dynamic programming. It selects the optimal path as the highest ranked alignment. Smith-Waterman algorithm is useful for finding local areas of similarity between sequences that are too dissimilar for global alignment. The S-W algorithm uses a lot of computer memory. BLAST and FASTA are other search algorithms that use some aspects of S-W. Developing Pairwise Sequence Alignment Algorithms

Smith-Waterman (cont. 1)
a. It searches for sequence matches. b. Assigns a score to each pair of amino acids -uses similarity scores -uses positive scores for related residues -uses negative scores for substitutions and gaps c. Initializes edges of the matrix with zeros d. As the scores are summed in the matrix, any sum below 0 is recorded as a zero. e. Begins backtracing at the maximum value found anywhere in the matrix. f. Continues the backtrace until the score falls to 0. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Smith-Waterman (cont. 2) H E A G A W G H E E Put zeros on borders. Assign initial scores based on a scoring matrix. Calculate new scores based on adjacent cell scores. If sum is less than zero or equal to zero begin new scoring with next cell. P A W H E This example uses the BLOSUM45 Scoring Matrix with a gap penalty of -8. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero AWGHE || || AW-HE Path Score=28 Developing Pairwise Sequence Alignment Algorithms

Calculation of similarity score and percent similarity
A W G H E A W - H E Blosum45 SCORES GAP PENALTY (novel) % SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF AAs IN REGION x 100 Similarity Score= 28 % SIMILARITY = 4/5 x 100 = 80% Developing Pairwise Sequence Alignment Algorithms

Extend LCS to Local Alignment
0 (no negative scores) si-1,j + (vi, -) si,j = max { si,j-1 + (-, wj) si-1,j-1 + (vi, wj) (vi, -) = (-, wj) = - = fixed gap penalty (vi, wj) = score for match or mismatch – can be fixed, from PAM or BLOSUM Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be A=a1a an, and B=b1b bm. A similarity s(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wk. To find pairs of segments with high degrees of similarity, we set up a matrix H . First set Hk0 = Hol = 0 for 0 <= k <= n and 0 <= l <= m. Preliminary values of H have the interpretation that H i j is the maximum similarity of two segments ending in ai and bj. respectively. These values are obtained from the relationship Hij=max{Hi-1,j-1 + s(ai,bj), max {Hi-k,j – Wk}, max{Hi,j-l - Wl }, 0} ( 1 ) k >= l >= 1 1 <= i <= n and 1 <= j <= m. Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (2 of 3)
The formula for Hij follows by considering the possibilities for ending the segments at any ai and bj. If ai and bj are associated, the similarity is Hi-l,j-l + s(ai,bj). (2) If ai is at the end of a deletion of length k, the similarity is Hi – k, j - Wk . (3) If bj is at the end of a deletion of length 1, the similarity is Hi,j-l - Wl. (typo in paper) (4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj. Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is found by first locating the maximum element of H. The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero. This procedure identifies the segments as well as produces the corresponding alignment. The pair of segments with the next best similarity is found by applying the traceback procedure to the second largest element of H not associated with the first traceback. Developing Pairwise Sequence Alignment Algorithms

The E value (false positive expectation value)
The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance. Developing Pairwise Sequence Alignment Algorithms

E value (Karlin-Altschul statistics)
E = K•m•n•e-λS Where K is constant, m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score. If S increases, E decreases exponentially. If the decay constant increases, E decreases exponentially If m•n increases the “search space” increases and there is a greater chance for a random “hit”, E increases. Larger database will increase E. However, larger query sequence often decreases E. Why??? Developing Pairwise Sequence Alignment Algorithms

Project Teams and Presentation Assignments
Base Project (Global Alignment): Larry and Darnell Extension 1 (Ends-Free Global Alignment): Steven and Charlie Extension 2 (Local Alignment): Olivera and Natalia Extension 3 (Local Alignment – all): Brittany and Alana Extension 4 (Database): Nathaniel and Anna U. Extension 5 (Space Efficient Algorithm): David and Shilpa Extension 6 (Affine Gap Penalty): Rachel and Anna P. Extension 7 (Hirschberg’s Algorithm): Wendy and Andrew Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms
Workshop Meet with your group and develop for the overall structure of your program High-level algorithm Identify the modules, functions (including parameters), and global variables Determine who is responsible for each module Devise a development timeline and a testing strategy Developing Pairwise Sequence Alignment Algorithms