4 -1 Chapter 4 The Sequence Alignment Problem. 4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Chapter 7 Dynamic Programming.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Measuring the degree of similarity: PAM and blosum Matrix
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Chapter 3 The Greedy Method 3.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Chapter 7 Dynamic Programming 7.
§ 8 Dynamic Programming Fibonacci sequence
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
4 - 1 Chap 4 The Sequence Alignment Problem The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Introduction to Bioinformatics Algorithms Sequence Alignment.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Arc-Segment Alignment for RNA Secondary Structure 指導教授:楊昌彪 學生姓名:彭永興.
Multiple Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Longest Common Subsequence Problem and Its Approximation Algorithms Kuo-Si Huang ( 黃國璽 )
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lectures on Greedy Algorithms and Dynamic Programming
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Core String Edits, Alignments, and Dynamic Programming.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Sequence Alignment 11/24/2018.
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
Computational Genomics Lecture #3a
Presentation transcript:

4 -1 Chapter 4 The Sequence Alignment Problem

4 -2 The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 : deleting 0 or more symbols from S 1 (not necessarily consecutive). e.g. G, AGC, TATC, AGACG Common subsequences of S 1 = “ TAGTCACG ” and S 2 = “ AGACTGTC ” : GG, AGC, AGACG Longest common subsequence (LCS) : S 1 : TAGTCACG S 2 : AGACTGTC LCS : AGACG

4 -3 Applications of LCS The edit distance of two strings or files. (# of deletions and insertions) S 1 : TAGTCAC G S 2 : AG ACTGTC Operation: DMMDDMMIMII Spoken word recognition Similarity of two biological sequences (DNA or protein) Sequence alignment

4 -4 The LCS Algorithm S 1 = a 1 a 2  a m and S 2 = b 1 b 2  b n A i,j denotes the length of the longest common subsequence of a 1 a 2  a i and b 1 b 2  b j. Dynamic programming: A i,j = A i-1,j if a i = b j max{ A i-1,j, A i,j-1 } if a i  b j A 0,0 = A 0,j = A i,0 = 0 for 1  i  m, 1  j  n. Time complexity: O(mn)

4 -5 By the dynamic programming, we can calculate matrix A starting at the upper left corner and ending at the lower right corner. Simply, we can calculate it row by row, or column by column.

4 -6 After matrix A has been found, we can trace back to find the LCS. TAGTCACG AGACTGTC LCS:AGACG S2S2 S1S1

4 -7 Edit Distance(1) To find a smallest edit process between two strings. S 1 : TAGTCAC G S 2 : AG ACTGTC Operation: DMMDDMMIMII

4 -8 Edit Distance(2) TAGTCAC G AG ACTGTC DMMDDMMIMII S2S2 S1S1

4 -9 The Longest Increasing Subsequence (LIS) Problem Definition: Input: One numeric sequence S Output: The longest increasing subsequence in S Example: Given S = , the LIS in S is By applying the LCS algorithm, this problem can be solved in O(n 2 ) time. (Why?) Robinson-Schensted-Knuth Algorithm can solve the LIS problem in O(nlogn) time. (See the example on the next page.)

4 -10 Robinson-Schensted-Knuth Algorithm for LIS L Input LIS: 3578 time complexity: O(nlogn) n numbers are inserted and each insertion takes O(logn) time for binary search.

4 -11 Hunt-Szymanski LCS Algorithm By extending the idea in RSK algorithm, the LCS problem can be solved in O(rlogn) time, where r denotes the number of matches. This algorithm is faster than traditional dynamic programming if r is small.

4 -12 The Pairs of Matching AGACTGTC T A G T C A C G (1,5)(1,7) (2,1)(2,3) (3,2)(3,6) (4,5)(4,7) (5,4)(5,8) (6,1)(6,3) (7,4)(7,8) (8,2)(8,6) Input sequences: TAGTCACG and AGACTGTC Pairs of matching:

4 -13 Example for Hunt-Szymanski Algorithm (1,7)(1,5)(2,3)(2,1)(3,6)(3,2)(4,7)(4,5)(5,8)(5,4) 1(1,7)(1,5)(2,3)(2,1) 2(3,6)(3,2) 3(4,7)(4,5) (5,4) 4(5,8) The insertion order is row major and column backward. Exercise: Please fill out the rest parts by yourself. Time Complexity: O(rlogn), r: # of matches Each match needs O(logn) time for binary search. L

4 -14 The Longest Common Increasing Subsequence (LCIS) Problem Definition: Input: Two numeric sequences S 1, S 2 Output: The longest common increasing subsequence of S 1 and S 2. Example: Given S 1 = and S 2 = , the LCIS of S 1 and S 2 is 246 This problem can be solved by applying the RSK algorithm on the table for finding LCS(Chao’s Algorithm). (See the example on the next page.)

4 -15 Chao’s Algorithm for LCIS L1: L1: 1 7 -L1: 5 L2: 7 L1: 5 L2: 7 L1: 5 L2: 7 L1: 1 L2: 7 L1: 1 L2: 7 2 -L1: 5L1: 2 L2: 7 L1: 2 L2: 7 L1: 2 L2: 7 L1: 1 L2: 7 L1: 1 L2: 7 4 -L1: 5L1: 2 L2: 7 L1: 2 L2: 4 L1: 2 L2: 4 L1: 1 L2: 4 L1: 1 L2: 4 8 -L1: 5L1: 2 L2: 7 L1: 2 L2: 4 L1: 2 L2: 4 L3: 8 L1: 1 L2: 4 L3: 8 L1: 1 L2: 4 L3: 8 6 -L1: 5L1: 2 L2: 7 L1: 2 L2: 4 L1: 2 L2: 4 L3: 8 L1: 1 L2: 4 L3: 8 L1: 1 L2: 4 L3: 6 3 L1: 3 L1: 2 L2: 7 L1: 2 L2: 4 L1: 2 L2: 4 L3: 8 L1: 1 L2: 4 L3: 8 L1: 1 L2: 4 L3: 6

4 -16 Analysis for Chao’s Algorithm There are two types of operations to update the best tails, insert (match) and merge (mismatch). Direct implementation will take O(n 3 ) time, since it cost O(n) for each operation. However, it can be shown that each merge can be done in constant time. Also, all insertions in a row will totally take O(n) time. Thus, This is an O(n 2 ) algorithm

4 -17 The Constrained Longest Common Subsequence (CLCS) Problem Definition: Input: Two sequences S 1, S 2, and a constrained sequence C. Output: The longest common subsequence of S 1, S 2 that contains C. Example: Given S 1 = TAGTCACG, S 2 = AGACTGTC and C=AT, the CLCS between S 1 and S 2 would be AGTG. (LCS is AGACG) Purpose: From biological perspective, we can specify the functional sites in input sequences by setting proper constraints.

4 -18 The CLCS Algorithm S 1 = a 1 a 2  a m, S 2 = b 1 b 2  b n and C = c 1 c 2  c r R k,i,j denotes the length of the longest common subsequence of a 1 a 2  a i, b 1 b 2  b j. and c 1 c 2  c k Dynamic programming : R k,i,j = R k-1, i-1,j if c k = a i = b j R k, i-1,j if c k  a i = b j max {R k, i-1,j, R k, i,j-1 } if a i  b j R k,0,0 = R k,i,0 = R k,0,i = -∞ for 1  k  r, 1  i  m, 1  j  n. R 0,i,j = A i,j (LCS without constraint, please read previous pages) Time complexity: O(rnm)

4 -19 Example for CLCS Algorithm -AGACTGTC T A G T C A C G AGACTGTC -XXXXXXXXX TXXXXXXXXX AX GX TX CX AX CX GX AGACTGTC -XXXXXXXXX TXXXXXXXXX AXXXXXXXXX GXXXXXXXXX TXXXXX3333 CXXXXX3334 AXXXXX3334 CXXXXX3334 GXXXXX3444 k = 0 k = 2 (constraint T) k = 1 (constraint A) Following the link, we can obtain the CLCS AGTG Input: S 1 = TAGTCACG, S 2 = AGACTGTC and C = AT CLCS of S 1 and S 2 with constraint C: (X means -∞)

4 -20 Sequence Alignment S 1 = TAGTCACG S 2 = AGACTGTC  ----TAGTCACG TAGTCAC-G-- AGACT-GTC--- -AG--ACTGTC Which one is better? We can set different gap penalties as parameters for different purposes.

4 -21 Sequence Alignment Problem Definition: Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: The alignment of S 1, S 2, …, S n, which has the optimal score. Purpose: To determine how close two species are To perform data compression To determine the common area of some sequences To construct evolutionary trees

4 -22 Gap Penalty is the gap penalty. Suppose

4 -23 Example for Sequence Alignment TAGTCAC-G-- -AG--ACTGTC

4 -24 PAM250 Score Matrix

4 -25 Blosum62 Score Matrix

4 -26 The Local Alignment Problem Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: Substrings S i ’ of S i such that the score obtained by aligning S i ’ is the highest, among all possible substrings of S i. (1  i  n) S 1 = abbbcc S 2 = adddcc Score=3  2+3  (–1)=3 S 1 ’ = cc S 2 ’ = cc Score=2  2=4

4 -27 Dynamic Programming for Local Alignment Once the score becomes negative, we reset it to 0.

4 -28 Example for Local Alignment AGTCAC-G AG--ACTG TAGTC T-GTC Two solutions:

4 -29 The Affine Gap Penalty S 1 = ACTTGATCC S 2 =AGTTAGTAGTCC An optimal alignment: S 1 =ACTT-G-A-TCC S 2 = AGTTAGTAGTCC Original score=12 The following alignment may be better because there is only one gap. S 1 =ACTT---GATCC S 2 =AGTTAGTAGTCC Original score=6

4 -30 Definition of Affine Gap Penalty A gap is caused by a mutational event which removes a sequence of residues.. A long gap is often more preferable than several gaps. An affine gap penalty is defined as P g +kP e for a gap with k, k  1, spaces where P g, P e  0. P g is related to the initiation of a gap and P e is related to the length of the gap.

4 -31 Suppose that P g =4 and P e =1. S 1 = ACTTGATCC S 2 =AGTTAGTAGTCC S 1 =ACTT-G-A-TCC S 2 = AGTTAGTAGTCC Score=8  2 – 1  1 – 3  (4+1  1)=0 S 1 =ACTT---GATCC S 2 =AGTTAGTAGTCC Score=6  2 – 3  1 – (4+3  1)=2

4 -32 Algorithm for Affine Gap Penalty A(i,j) is for the optimal alignment of a 1 a 2  a i and b 1 b 2  b j. A 1 (i,j) is for that a i is aligned b j. A 2 (i,j) is for that a i is aligned -. A 3 (i,j) is for that - is aligned b j.

4 -33 Multiple Sequence Alignment (MSA) Suppose three sequence are involved: S 1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT A very good alignment: S 1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT In fact, the above alignment between every pair of sequences is also good.

4 -34 Complexity of MSA 2-sequence alignment problem: Time complexity: O(n 2 ) 3-sequence alignment problem:  (x,y,z) has to be defined. Time complexity: O(n 3 ) k-sequence alignment problem: O(n k )

4 -35 The Star Algorithm for MSA Proposed by Gusfield An approximation algorithm for the sum of pairs multiple sequence alignment problem Let  (x,y)=0 if x=y and  (x,y)=1 if x  y. S 1 = GCCAT S 1 = GCCAT S 2 = G--AT S 2 = GA--T distance=2 distance=3 The distance induced by the alignment is define as

4 -36 Properties of d(S i,S j ): d(S i,S i ) = 0 Triangular inequality d(S i,S j )+d(S i,S k )  d(S j,S k ) Given two sequences S i and S j, the minimum distance is denoted as D(S i,S j ). D(S i,S j )  d(S i,S j ) Distance i j k

4 -37 Example for the Star Algorithm S 1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC Try to align every pair of sequences: S 1 = ATGCTC S 2 = A-GAGC D(S 1,S 2 ) = 3 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3

4 -38 S 1 = AT-GC-T-C S 4 = ATTGCATGC D(S 1,S 4 ) = 3 S 2 = A--G-A-GC S 4 = ATTGCATGC D(S 2,S 4 ) = 4 S 2 = AGAGC S 3 = TTCTG D(S 2,S 3 ) = 5 S 3 = -TT-C-TG- S 4 = ATTGCATGC D(S 3,S 4 ) = 4

4 -39 D(S 1,S 2 )+D(S 1,S 3 )+D(S 1,S 4 ) = 9 D(S 2,S 1 )+D(S 2,S 3 )+D(S 2,S 4 ) = 12 D(S 3,S 1 )+D(S 3,S 2 )+D(S 3,S 4 ) = 12 D(S 4,S 1 )+D(S 4,S 2 )+D(S 4,S 3 ) = 11 S 1 is selected as the center since S 1 is the most similar to others. Given a set S of k sequences, the center of this set of sequences is the sequence which minimizes

4 -40 S 1 has been selected as the center. Align S 2 with S 1 : S 1 = ATGCTC S 2 = A-GAGC Adding S 3 by aligning S 3 with S 1 : S 1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG Adding S 4 by aligning S 4 with S 1 : S 1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC

4 -41 Approximation Rate App  2Opt (See the proof on the lecture note.)

4 -42 The MST Preservation for MSA In Gusfield’s star algorithm, the alignments between the center and all other sequences are optimal. Thus, (k–1) distances are preserved. MST preservation is to preserves the distances on the edges in the minimal spanning tree. D: distance matrix based upon optimal alignments between every pair of input sequences. D m : distance matrix based upon a multiple sequence alignment MST(D): MST based on D MST(D m ): MST based on D m Goal: MST(D)=MST(D m )

4 -43 Example for MST Preservation Input: S 1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATTGCATGC Step1: Finds the pair wise distances optimally by the dynamic programming algorithm. S 1 = ATGCTC S 2 = ATGAGC D(S 1,S 2 ) = 2 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3

4 -44 S 1 = ATGC-T-C S 4 = ATGCATGC D(S 1,S 4 ) = 2 S 2 = ATG-A-GC S 4 = ATGCATGC D(S 2,S 4 ) = 2 S 2 = ATGAGC S 3 = TTCTG- D(S 2,S 3 ) = 4 S 3 = -TTC-TG- S 4 = ATGCATGC D(S 3,S 4 ) = 4 Distance matrix D

4 -45 Step 2: Find the minimal spanning tree based on matrix D. S1S1 S2S2 S4S4 S3S

4 -46 Step 3: Align the pair of sequences optimally corresponding to the edges on the MST. For e(S 1, S 2 ) S 1 = ATGCTC S 2 = ATGAGC For e(S 2, S 4 ) S 1 = ATG-C-TC S 2 = ATG-A-GC S 4 = ATGCATGC For e(S 1, S 3 ) S 1 = ATG-C-TC S 2 = ATG-A-GC S 3 = TT--C-TG S 4 = ATGCATGC Step 4: Output the above as the final alignment. S1S1 S2S2 S4S4 S3S

4 -47 Distance matrix D m and the minimal spanning tree based on D m : Theorem: MST(D) is equal to MST(D m ). MST Preservation S1S1 S2S2 S4S4 S3S