6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Longest Common Subsequence
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer:
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
Chapter 3 Computational Molecular Biology Michael Smith
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Core String Edits, Alignments, and Dynamic Programming.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Bioinformatics.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence comparison: Local alignment
String Processing.
Local alignment and BLAST
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence Alignment 11/24/2018.
Computational Biology Lecture #7: Local Alignment
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
String Processing.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Presentation transcript:

6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science and Mathematics 10 ¦ 22 ¦ 2002

6/11/2015 © Bud Mishra, 2001 L7-2 Local Alignment Problem (LAP) Finding substrings of high similarity: Given two strings, S 1 and S 2 : They may have regions that are locally highly similar.

6/11/2015 © Bud Mishra, 2001 L7-3 LAP Local Alignment Problem Given: Two strings S 1 and S 2 Find: Substrings  v S 1 and  v S 2 whose similarity (in terms of an object function—e.g., optimal global alignment value) is maximum over all pairs of substrings from S 1 and S 2 v * = max  v S1,  v S2 distance( ,  )

6/11/2015 © Bud Mishra, 2001 L7-4 Example d(x,x) = 2,d(x,y) = -2 d(x,-) = d(-,x) = -1.  = a x a b c s v S 1  = a x b a c s v S 2 S_1 = p q r a x a b c s t v q  S_2 = x y a x b a c s l l  Local Alignment: a x a b - c s | | | | | a x - b a c s distance( ,  ) = 8

6/11/2015 © Bud Mishra, 2001 L7-5 Naïve Complexity Note: (1) Let |S 1 | = n and |S 2 | = m. –Total number of substrings of S 1 = C n+1,2 = O(n 2 ) –Total number of substrings of S 2 = C m+1,2 = O(m 2 ) –Naïvely, O(n 2 m 2 )candidate substrings need to be globally aligned by a DP algorithm of complexity O(|  | |  |) Complexity of the resulting algorithm = O(n 3 m 3 ) (2) An improved algorithm (SWAT, Smith-Waterman) reduces the time complexity to O(nm)

6/11/2015 © Bud Mishra, 2001 L7-6 LSAP Local Suffix Alignment Problem A restricted version of the LAP. Given: Two strings S 1 and S 2 and two indices i 5 |S 1 | and j 5 |S 2 | –A i = S 1 [1..i] prefix of S 1 –B j = S 2 [1..j] prefix of S 2 Find: A suffix (possibly empty, ) of A i (  = S 1 [k..i]) and a suffix of B j (possibly empty, ) of B j (  = S 2 [l..j]) that maximizes a linear objective function V( ,  ) over all pairs of suffixes of A i and B j. ð

6/11/2015 © Bud Mishra, 2001 L7-7 Objective Function v(i,j) = max  = suf S1[1..i],  = suf S2[1..j] V( ,  ) = Value of the optimal local suffix alignment for the given index pair i, j. v * = max i 5 n, j 5 m v(i,j) = Value of the optimal local alignment. n = |S 1 |, m = |S 2 |

6/11/2015 © Bud Mishra, 2001 L7-8 Optimal Local Alignment Recurrence Equations v * = max i 5 n, j 5 m V(i,j)  = suf S [ 1..i],  = suf S 2 [1..j] v * = v(i’, j’) = V( ,  ) Consider an optimal suffix alignment with  = suf S 1 [1..i] and  = suf S 2 [1..j] Case 1:  =  = (= empty string) –Base: V( ,  ) = 0

6/11/2015 © Bud Mishra, 2001 L7-9 Optimal Local Alignment Recurrence Equations Case 2: ,  =  ‘ ± S 1 [i] and S 1 [i] matches “-” –Ind(A): V( ,  ) = V(  ’,  ) + d(S 1 [i], -) …or S 1 [i] matches S 2 [j] (  =  ’ ± S 2 [j]) –Ind(C): V( ,  ) = V(  ’,  ’) + d(S [ i], S 2 [j])

6/11/2015 © Bud Mishra, 2001 L7-10 Optimal Local Alignment Recurrence Equations Case 3:  ,  =  ’ ± S 2 [j] and S 2 [j] matches “-” –Ind(B): V( ,  ) = V( ,  ’) + d(-, S 2 [j]) …or S 1 [i] matches S 2 [j] (  =  ’ ± S 1 [i]) –Ind(C): V( ,  ) = V(  ’,  ’) + d(S [ i], S 2 [j])

6/11/2015 © Bud Mishra, 2001 L7-11 Recurrence Equation V(i,j) = max  = suf S1[1..i],  = suf S2[1..j] V( ,  ) Base: v(i,j)| i=0 Ç j=0 = 0 (v(0,0) = v(i,0) = v(0,j) = 0) Induction: v(i,j)| i=0 Æ j=0 =max[0, v(i-1,j) + d(S 1 [i],-), v(i,j-1)+ d(-, S 2 [j]), v(i-1,j-1), d(S 1 [i], S 2 [j]) ]

6/11/2015 © Bud Mishra, 2001 L7-12 Dynamic Programming Table (with Traceback) Compute all v(i,j) entries: Complexity = O(nm) Find v * = v(i *, j * ) by finding the largest value in any cell: Complexity = O(nm) Trace the pointer back from from v(I *, j * ) until a cell is reached with value v(i’,j’) =0: Complexity = O(n+m) Results:  = S 1 [i’..i * ] v S 1 and  = S 2 [j’..j * ] v S 2 Total Complexity = O(nm) = O(|S 1 |, |S 2 |)

6/11/2015 © Bud Mishra, 2001 L7-13 Example.xyaxbacsll p q r a à 1à à 1000 x à 1à 1 " 1" à 3à 3 à 2à 2 à 1à 1000 a0 " 1" " 3" 3 " 2" à 4à 4 à 3à 3 à 2à 2 à 1à 1 b000 " 2" 2 " 2" à 4à 4 à 3à 3 à 2à 2 à 1à 1 à 0à 0 c000 " 1" 1 " 1" 1 " 4" 4 " 3" à 5à 5 à 4à 4 à 3à 3 s00000 " 3" 3 " 2" 2 " 5" à 7à 7 à 7à 7 t00000 " 2" 2 " 1" 1 " 4" 4 " 7" 7 à 6à 6 à 6à 6 v00000 " 1" 10 " 3" 3 " 6" 6 à 5à 5 à 5à 5 q " 2" 2 " 5" 5 à 4à 4 à 4à 4

6/11/2015 © Bud Mishra, 2001 L7-14 Dealing with Gaps A gap is any “maximal consecutive run of spaces” in a single string of a given alignment. c t t t a a c - - a - a c c c a c c c a t - c gap, g 1 gap, g 2 gap, g 3 gap, g 4

6/11/2015 © Bud Mishra, 2001 L7-15 Gaps Initial Gap –A gap may be bordered on the right by the first character of a string. Final Gap –A gap may be bordered on the left by the last character of a string. Internal Gap –A gap may be bordered on both left and right Simple Gap Penalty Model  Constant Wt, W g –Each gap contributes a constant penalty = W g –d(x,x) = 2, d(x,y) = -2, d(x,-) = d(-,y) = 0 –# gaps = k. Then –Value of an alignment =  i=1 l d(S’ 1 [i], S’ 2 [i]) – k W g

6/11/2015 © Bud Mishra, 2001 L7-16 Biological Motivations for Gap Models –Unequal Crossing-over in Meiosis –DNA slippage during replication –Insertion of transposable elements (“Jumping Genes”) –Insertion by retroviruses –Translocation between chromosomes Examples of Alignment with gaps: –cDNA matching problem –Processed Pseudo-gene Problem

6/11/2015 © Bud Mishra, 2001 L7-17 Gap Weights Constant: –Each gap has a penalty of W g –Each space is free: d(x,-) = d(-,x) = 0. Affine: –Gap initiation weight = W g –Gap Extension weight = W s –Each gap of length q has a penalty of W g + q W s Convex: –Each gap of length q has a penalty of W g + ln q W s Arbitrary: –Each gap of length q has a penalty of W g +  ( q) W s, where  (q) = arbitrary function

6/11/2015 © Bud Mishra, 2001 L7-18 General Model Arbitrary: –Each gap of length q has a penalty of W g +  ( q) W s, where  (q) = arbitrary function –  (q) = 0  constant –  (q) = q  linear/affine –  (q) = ln q  convex Total Cost under constant model  i=1 l d(S’ 1 [i], S’ 2 [i]) – (#gaps) W g Total Cost under affine model  i=1 l d(S’ 1 [i], S’ 2 [i]) – (#gaps) W g – (#spaces) W s

6/11/2015 © Bud Mishra, 2001 L7-19 Local Alignment under Arbitrary Gap Weight Model Dynamic Programming (Needleman & Wunsch) Given two strings S 1 and S 2 start by aligning the prefixes –S 1,i = S 1 [1..i] and –S 2,j = S 2 [1..j] There are three different cases to consider…

6/11/2015 © Bud Mishra, 2001 L7-20 Case 1 S 1 [i] is aligned to a character strictly to the left of a character S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

6/11/2015 © Bud Mishra, 2001 L7-21 Case 2 S 1 [i] is aligned to a character strictly to the right of a character S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

6/11/2015 © Bud Mishra, 2001 L7-22 Case 3 S 1 [i] and S 2 [j] are aligned opposite each other: –Subcase A S 1 [i] = S 2 [j] –Subcase B S 1 [i]  S 2 [j] S 1,i S 2,j S 1 [i] S 2 [j]

6/11/2015 © Bud Mishra, 2001 L7-23 Auxiliary Vaiables X L (i,j) = max alignments for case 1 distance(S 1 [1..i], S 2 [1..j]) X R (i,j) = max alignments for case 2 distance(S 1 [1..i], S 2 [1..j]) X S (i,j) = max alignments for case 3 distance(S 1 [1..i], S 2 [1..j]) V(i,j) = max(X L (i,j), X R (i,j), X S (i,j))

6/11/2015 © Bud Mishra, 2001 L7-24 Recurrence: Base Notation: ?, “undefined” X S (0,0) = 0,X S (i,0) = ?,X S (0,j) = ? X L (0,0) = ?,X L (i,0) = -  (i),X L (0,j) = ? X R (0,0) = ?,X R (i,0) = ?,X R (0,j) = -  (j) V(0,0) = 0, V(i,0) = -  (i),V(0,j) = -  (j)

6/11/2015 © Bud Mishra, 2001 L7-25 Recurrence: Induction i > 0 and j > 0: X S (i,j) = V(i-1,j-1) + d(S 1 [i], S 2 [j]) X L (i,j) = max 0 5 k 5 j-1 (V(i,k) -  (j-k)) X R (i,j) = max 0 5 l 5 i-1 (V(l,j) -  (i-l)) V(i,j) = max(X L (i,j), X R (i,j),X S (i,j)) Each V(i,j) can be computed in time O(i+j)

6/11/2015 © Bud Mishra, 2001 L7-26 Total Time Complexity Let |S 1 | = n and |S 2 | = m. The recurrence can be evaluated with a Dynamic Programming Table of space complexity = O(nm) and in time complexity = O(n 2 m+m 2 n)

6/11/2015 © Bud Mishra, 2001 L7-27 Affine Gap Model- Recurrence SWAT : Smith-Waterman Modifying the recurrence equations for the affine case: –X S (0,0) = 0, X S (i,0) = ?, X S (0,j) = ? –X L (0,0) = ?, X L (i,0) = -W g -i W s, X L (0,j) = ? –X R (0,0) = ?, X R (i,0) = ?, X R (0,j) = -W g - j W s –V(0,0) = 0, V(i,0) = -W g -i W s, V(0,j) = -W g - j W s

6/11/2015 © Bud Mishra, 2001 L7-28 Recurrence: Induction i > 0 and j > 0: X S (i,j) = V(i-1,j-1) + d(S 1 [i], S 2 [j]) X L (i,j) = max(X L (i, j-1) –W s, ?, X S (i,j-1) – W g –W s, V(i,j-1)-W g -W s ) = max[X L (i, j-1), V(i,j-1)-W g ] –W s X R (i,j) = max( ?, X R (i-1, j) –W s, X S (i-1,j) – W g –W s, V(i-1,j)-W g -W s ) = max[X R (i-1, j), V(i-1,j)-W g ] –W s V(i,j) = max(X L (i,j), X R (i,j),X S (i,j)) Each V(i,j) can be computed in O(1) time. The optimal alignment with affine gap weights can be computed with a DP table of space and time complexity = O(nm).

6/11/2015 © Bud Mishra, 2001 L7-29 Parallelization Systolic Arrays: Create a special-purpose processor P(i,j) for (i,j) th entry of the Dynamic Programming Table. Connect P(i,j) to P(i-1,j), P(i-1,j-1) and P(i, j-1) Each processor holds static data W g and W s. Each processor stores and transmits dynamic data: X S (i,j), X L (i,j), X R (i,j) and V(i,j).

6/11/2015 © Bud Mishra, 2001 L7-30 Systolic Computation Dynamically compute in one cycle: –X S (i,j), X L (i,j), X R (i,j), V(i,j) using –X S (i-1,j), X L (i-1,j), X R (i-1,j), V(i-1,j) –X S (i,j-1), X L (i,j-1), X R (i,j-1), V(i,j-1) –X S (i-1,j-1), X L (i-1,j-1), X R (i-1,j-1), V(i-1,j-1) and –W g & W s.

6/11/2015 © Bud Mishra, 2001 L7-31 Database Search Blast & Its relatives: A query search \Rightarrow –Compare the query sequence to all the sequences in the database for local similarities. Heuristics: –BLAST –FAST Needs good complexity Analysis

6/11/2015 © Bud Mishra, 2001 L7-32 BLAST Basic Local Alignment Search Tool Query sequence,  2  *, Database, L µ  * BLAST returns a list of high scoring segment pairs between the query sequence and sequences in the database. Score function depends on  -PAM score functions.

6/11/2015 © Bud Mishra, 2001 L7-33 BLAST Heuristics BLAST is a 3 step algorithm: –Step 1. Compile list of high scoring strings: W = words. W =All w-mers that score at least  with some w-mer of the query. –Step 2. Search for hits—Each hit defines a seed. Construct a DFA to recognize \cW. Scan the database compiling the hits. –Step 3. Extend the seeds. The seeds are extended in both directions until the score falls a certain distance below the best so far.

6/11/2015 © Bud Mishra, 2001 L7-34 FAST s, t = Two sequences being compared. |s| = m & |t| = n. –Step 1. Determine k-tuples common to both sequences—k = 1 or 2. –Step 2. “Offset” of a common k-tuple is computed. If the common k-tuples start at position s[i] and t[j], then offset = i-j –Step 3. Determine the most common offset value to align the sequences. –Step 4. Combine the common k-tuples to create a region.

6/11/2015 © Bud Mishra, 2001 L7-35 Example Offsets for 1-tuples –A ( (2,6,7) –F ( (4) –H ( (1) –I ( (9) –L ( (11) –Q ( (8) –R ( (3) –V ( (10) –Y ( (5) Alignment: –H A R F Y A A Q I V L | | | | + – V D MA AQ I A s= H A R F Y A A Q I V L t = V D M A A Q I A {9} {-2,2,3} {-3,1,2} {-6,-2,-1} {2}