Developing Pairwise Sequence Alignment Algorithms

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Rationale for searching sequence databases
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
©CMBI 2005 Sequence Alignment In phylogeny one wants to line up residues that came from a common ancestor. For information transfer one wants to line up.
Heuristic alignment algorithms and cost matrices
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Protein Sequence Alignment and Database Searching.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Alignment methods April 21, 2009 Quiz 1-April 23 (JAM lectures through today) Writing assignment topic due Tues, April 23 Hand in homework #3 Why has HbS.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Sequence alignment BI420 – Introduction to Bioinformatics
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez

Developing Pairwise Sequence Alignment Algorithms Outline Overview of global and local alignment References for sequence alignment algorithms Discussion of Needleman-Wunsch iterative approach to global alignment Discussion of Smith-Waterman recursive approach to local alignment Discussion of how LCS Algorithm can be extended for Global alignment (Needleman-Wunsch) Local alignment (Smith-Waterman) Group assignments for project Developing Pairwise Sequence Alignment Algorithms

Overview of Pairwise Sequence Alignment Dynamic Programming Applied to optimization problems Useful when Problem can be recursively divided into sub-problems Sub-problems are not independent Needleman-Wunsch is a global alignment technique that uses an iterative algorithm and no gap penalty (could extend to fixed gap penalty). Smith-Waterman is a local alignment technique that uses a recursive algorithm and can use alternative gap penalties (such as affine). Smith-Waterman’s algorithm is an extension of Longest Common Substring (LCS) problem and can be generalized to solve both local and global alignment. Note: Needleman-Wunsch is usually used to refer to global alignment regardless of the algorithm used. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms References http://www.sbc.su.se/~arne/kurser/swell/pairwise_alignments.html An Introduction to Bioinformatics Algorithms (Computational Molecular Biology) Neil C. Jones, Pavel Pevzner Computational Molecular Biology – An Algorithmic Approach, Pavel Pevzner Introduction to Computational Biology – Maps, sequences, and genomes, Michael Waterman Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology, Dan Gusfield Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Classic Papers Needleman, S.B. and Wunsch, C.D. A General Method Applicable to the Search for Similarities in Amino Acid Sequence of Two Proteins. J. Mol. Biol., 48, pp. 443-453, 1970. (http://www.cs.umd.edu/class/spring2003/cmsc838t/papers/needlemanandwunsch1970.pdf) Smith, T.F. and Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol., 147, pp. 195-197, 1981.(http://www.cmb.usc.edu/papers/msw_papers/msw-042.pdf) Developing Pairwise Sequence Alignment Algorithms

Why search sequence databases? I have just sequenced something. What is known about the thing I sequenced? I have a unique sequence. Is there similarity to another gene that has a known function? I found a new protein sequence in a lower organism. Is it similar to a protein from another species? Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method Output: An alignment of two sequences is represented by three lines The first line shows the first sequence The third line shows the second sequence. The second line has a row of symbols. The symbol is a vertical bar wherever characters in the two sequences match, and a space where ever they do not. Dots (or dashes) may be inserted in either sequence to represent gaps. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 1) For example, the two hypothetical sequences abcdefghajklm abbdhijk could be aligned like this || | | || abbd...hijk As shown, there are 6 matches, 2 mismatches, and one gap of length 3. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 2) The alignment can be scored according to a payoff matrix for fixed scoring. You can also use PAM or BLOSUM. $payoff = {match => $match, mismatch => $mismatch, gap_open => $gap_open, gap_extend => $gap_extend}; For correct operation, an algorithm is created such that the match must be positive and the other payoff entities must be negative. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 3) Example Given the payoff matrix $payoff = {match => 4, mismatch => -3, gap_open => -2, gap_extend => -1}; What is the alignment and what is the alignment score for the Following two sequences? Sequence 1: abcdefghajklm Sequence 2: abbdhijk Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 4) The sequences abcdefghajklm abbdhijk are aligned and scored like this a b c d e f g h a j k l m | | | | | | a b b d . . . h i j k match 4 4 4 4 4 4 mismatch -3 -3 gap_open -2 gap_extend -1-1-1 for a total score of 24-6-2-3 = 13. Developing Pairwise Sequence Alignment Algorithms

Global Alignment Method (cont. 5) The algorithm should guarantee that no other alignment of these two sequences has a higher score under this payoff matrix. Developing Pairwise Sequence Alignment Algorithms

Three steps in Dynamic Programming 1. Initialization 2. Matrix fill or scoring 3. Traceback and alignment Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Ends-Free Global Alignment with Fixed Scoring Two sequences will be aligned. GGATCGA (sequence #1 = V) GAATTCAGTTA (sequence #2 = W) A simple fixed scoring scheme will be used (vi, wj) = 1 if the residue at position i of sequence #1 (vi) is the same as the residue at position j of the sequence #2 (wj) – called match score (vi, wj) = 0 if vi ≠ wj – called mismatch score (vi, -) = (-, wj) = 0 – called gap penalty Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Matrix fill step: Each position Si,j is defined to be the MAXIMUM score at position i,j Si,j = MAXIMUM [ Si-1, j-1 + (vi, wj) (match or mismatch in the diagonal) Si, j-1 + w (gap in sequence #1 = V) Si-1, j + w (gap in sequence #2 = W)] column row Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Initialization step: 1) Create Matrix with m + 1 columns and n + 1 rows, where n = number of letters in sequence 1 and m = number of letters in sequence 2. 2) First column and first row will be filled with 0’s (for ends-free alignment). Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Scoring Step: Fill in row by row: Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Traceback Step: Seq#1 G G A - T C - G - - A | | | | | | Seq#2 G A A T T C A G T T A Developing Pairwise Sequence Alignment Algorithms

Global Alignment output file Global: HBA_HUMAN vs HBB_HUMAN Score: 290.50 HBA_HUMAN 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44 |:| :|: | | |||| : | | ||| |: : :| |: :| HBB_HUMAN 1 VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43 HBA_HUMAN 45 HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83 | ||| |: :|| ||||| | :: :||:|:: : | HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN 84 SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128 |:|| || ||| ||:|| : |: || | |||| | |: | HBB_HUMAN 89 SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133 HBA_HUMAN 129 LASVSTVLTSKYR 141 :| |: | || HBB_HUMAN 134 VAGVANALAHKYH 146 %id = 45.32 %similarity = 63.31 (88/139 *100) Overall %id = 43.15; Overall %similarity = 60.27 (88/146 *100) Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms LCS Problem (review) Similarity score si-1,j si,j = max { si,j-1 si-1,j-1 + 1, if vi = wj Developing Pairwise Sequence Alignment Algorithms

Extend LCS to Global Alignment si-1,j + (vi, -) si,j = max { si,j-1 + (-, wj) si-1,j-1 + (vi, wj) (vi, -) = (-, wj) = - = fixed gap penalty (vi, wj) = score for match or mismatch – can be fixed, from PAM or BLOSUM Developing Pairwise Sequence Alignment Algorithms

Global Alignment Alternatives Ends-free alignment – don’t penalize gaps at the beginning or end Initialize first row and column of S to 0 Search last row and column for maximum score Regular global alignment – score end to end (penalize gaps at beginning and end) Initialize first row and column of S with gap penalty Alignment score is in the lower right corner of S Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (1 of 3) Match = 1 Mismatch = 0 Gap = 0 Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (2 of 3) Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Needleman-Wunsch (3 of 3) From page 446: It is apparent that the above array operation can begin at any of a number of points along the borders of the array, which is equivalent to a comparison of N-terminal residues or C-terminal residues only. As long as the appropriate rules for pathways are followed, the maximum match will be the same. The cells of the array which contributed to the maximum match, may be determined by recording the origin of the number that was added to each cell when the array was operated upon. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Smith-Waterman Algorithm Advances in Applied Mathematics, 2:482-489 (1981) The Smith-Waterman algorithm is a local alignment tool used to obtain sensitive pairwise similarity alignments. Smith-Waterman algorithm uses dynamic programming. It selects the optimal path as the highest ranked alignment. Smith-Waterman algorithm is useful for finding local areas of similarity between sequences that are too dissimilar for global alignment. The S-W algorithm uses a lot of computer memory. BLAST and FASTA are other search algorithms that use some aspects of S-W. Developing Pairwise Sequence Alignment Algorithms

Smith-Waterman (cont. 1) a. It searches for sequence matches. b. Assigns a score to each pair of amino acids -uses similarity scores -uses positive scores for related residues -uses negative scores for substitutions and gaps c. Initializes edges of the matrix with zeros d. As the scores are summed in the matrix, any sum below 0 is recorded as a zero. e. Begins backtracing at the maximum value found anywhere in the matrix. f. Continues the backtrace until the score falls to 0. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Smith-Waterman (cont. 2) H E A G A W G H E E Put zeros on borders. Assign initial scores based on a scoring matrix. Calculate new scores based on adjacent cell scores. If sum is less than zero or equal to zero begin new scoring with next cell. P A W H E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 0 10 2 0 0 1 12182214 6 0 2 16 8 0 0 4101828 20 0 0 82113 5 0 41020 27 0 0 6131912 4 0 416 26 This example uses the BLOSUM45 Scoring Matrix with a gap penalty of -8. Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Smith-Waterman (cont. 3) H E A G A W G H E E P A W H E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0 0 0 10 2 0 0 0 12182214 6 0 2 16 8 0 0 4101828 20 0 0 82113 5 0 41020 27 0 0 6131812 4 0 416 26 Begin backtrace at the maximum value found anywhere on the matrix. Continue the backtrace until score falls to zero AWGHE || || AW-HE Path Score=28 Developing Pairwise Sequence Alignment Algorithms

Calculation of similarity score and percent similarity A W G H E A W - H E 5 15 -8 10 6 Blosum45 SCORES GAP PENALTY (novel) % SIMILARITY = NUMBER OF POS. SCORES DIVIDED BY NUMBER OF AAs IN REGION x 100 Similarity Score= 28 % SIMILARITY = 4/5 x 100 = 80% Developing Pairwise Sequence Alignment Algorithms

Extend LCS to Local Alignment 0 (no negative scores) si-1,j + (vi, -) si,j = max { si,j-1 + (-, wj) si-1,j-1 + (vi, wj) (vi, -) = (-, wj) = - = fixed gap penalty (vi, wj) = score for match or mismatch – can be fixed, from PAM or BLOSUM Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (1 of 3) Algorithm The two molecular sequences will be A=a1a2 . . . an, and B=b1b2 . . . bm. A similarity s(a,b) is given between sequence elements a and b. Deletions of length k are given weight Wk. To find pairs of segments with high degrees of similarity, we set up a matrix H . First set Hk0 = Hol = 0 for 0 <= k <= n and 0 <= l <= m. Preliminary values of H have the interpretation that H i j is the maximum similarity of two segments ending in ai and bj. respectively. These values are obtained from the relationship Hij=max{Hi-1,j-1 + s(ai,bj), max {Hi-k,j – Wk}, max{Hi,j-l - Wl }, 0} ( 1 ) k >= 1 l >= 1 1 <= i <= n and 1 <= j <= m. Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (2 of 3) The formula for Hij follows by considering the possibilities for ending the segments at any ai and bj. If ai and bj are associated, the similarity is Hi-l,j-l + s(ai,bj). (2) If ai is at the end of a deletion of length k, the similarity is Hi – k, j - Wk . (3) If bj is at the end of a deletion of length 1, the similarity is Hi,j-l - Wl. (typo in paper) (4) Finally, a zero is included to prevent calculated negative similarity, indicating no similarity up to ai and bj. Developing Pairwise Sequence Alignment Algorithms

Historical Perspective: Smith-Waterman (3 of 3) The pair of segments with maximum similarity is found by first locating the maximum element of H. The other matrix elements leading to this maximum value are than sequentially determined with a traceback procedure ending with an element of H equal to zero. This procedure identifies the segments as well as produces the corresponding alignment. The pair of segments with the next best similarity is found by applying the traceback procedure to the second largest element of H not associated with the first traceback. Developing Pairwise Sequence Alignment Algorithms

The E value (false positive expectation value) The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance. Developing Pairwise Sequence Alignment Algorithms

E value (Karlin-Altschul statistics) E = K•m•n•e-λS Where K is constant, m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score. If S increases, E decreases exponentially. If the decay constant increases, E decreases exponentially If m•n increases the “search space” increases and there is a greater chance for a random “hit”, E increases. Larger database will increase E. However, larger query sequence often decreases E. Why??? Developing Pairwise Sequence Alignment Algorithms

Project Teams and Presentation Assignments Base Project (Global Alignment): Larry and Darnell Extension 1 (Ends-Free Global Alignment): Steven and Charlie Extension 2 (Local Alignment): Olivera and Natalia Extension 3 (Local Alignment – all): Brittany and Alana Extension 4 (Database): Nathaniel and Anna U. Extension 5 (Space Efficient Algorithm): David and Shilpa Extension 6 (Affine Gap Penalty): Rachel and Anna P. Extension 7 (Hirschberg’s Algorithm): Wendy and Andrew Developing Pairwise Sequence Alignment Algorithms

Developing Pairwise Sequence Alignment Algorithms Workshop Meet with your group and develop for the overall structure of your program High-level algorithm Identify the modules, functions (including parameters), and global variables Determine who is responsible for each module Devise a development timeline and a testing strategy Developing Pairwise Sequence Alignment Algorithms