Presentation is loading. Please wait.

Presentation is loading. Please wait.

4 - 1 Chap 4 The Sequence Alignment Problem. 4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment.

Similar presentations


Presentation on theme: "4 - 1 Chap 4 The Sequence Alignment Problem. 4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment."— Presentation transcript:

1 4 - 1 Chap 4 The Sequence Alignment Problem

2 4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment Problem The Local Alignment Problem The Affine Gap Penalty

3 4 - 3 Introduction What –Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. –Output: The alignment of S 1, S 2, …, S n, which has the optimal score. Who –Biologists want to know the secrets of DNA sequences. –Computerists take it as an interesting problem.

4 4 - 4 Introduction (Cont’) Where –Bioinformatics. Why –To determine how close two species are. –Data compression. When –Constructing evolutionary trees. How –This is why we are here.

5 4 - 5 The Sequence Alignment Problem S 1 =GAACTG, S 2 =GAGCTG, A scoring function f is –+2 if S 1 i is aligned with S 2 j, and S 1 i = S 2 j –-1 if otherwise. GAACTG--- GA---GCTG Score = 3 x (+2)+6 x (-1) =0 GAACTG GAGCTG Score = 5 x (+2)+1 x (-1) =9

6 4 - 6 The Dynamic Programming Approach

7 4 - 7 The Dynamic Programming Approach(Cont’)

8 4 - 8 The Local Alignment Problem Input:Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: Subsequences S i ’ of S i such that the score obtained by aligning S i ’ is highest, among all possible subsequences of S i. (1<= i <=n) S 1 = abbbcc S 2 = adddcc Score=3x2+3x(-1)=3 S 1 ’ = cc S 2 ’ = cc Score=2x2=4

9 4 - 9 The Local Alignment Problem(Cont’)

10 4 - 10 The Affine Gap Penalty Consider the following two sequences –S 1 =ACTTGATCC –S 2 =AGTTAGTAGTCC An optimal alignment of the above pair of sequences is as follows. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC Original Score=12 Gap concerned alignment is as follows. –S 1 =ACTT---GATCC –S 2 =AGTTAGTAGTCC Original Score=6

11 4 - 11 The Affine Gap Penalty(Cont’) A gap is caused by a mutational event which removed a sequence of residues. A simple mutational event is more likely than several events. Therefore a long gap is often more preferable than several gaps. An affine gap penalty is defined as P g +kP e for a gap with k, k>=1, spaces where P g,P e >= 0.

12 4 - 12 The Affine Gap Penalty(Cont’) Using our previous scoring function and further let P g =4 and P e =1. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score = 8x2-1-3x(4+1x1)=16-1-15=0 –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score=6x2-3x1-(4+3x1)=12-3-7=2

13 4 - 13 The Multiple Sequence Alignment Problem Consider the following case where three sequence are involved. S 1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT

14 4 - 14 In two sequences alignment problem. In three sequences alignment problem.

15 4 - 15 Avery good alignment of these three sequence is now shown as follows. S 1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT It is noted that the alignment between every pair of sequence is quite good.

16 4 - 16 The Gusfield Approximation Algorithm for the Sum of Pairs Multiple Sequence Alignment Problem We define The distance between the two sequences induced by the alignment is define as

17 4 - 17 d(S i,S j ) has the following characteristics: (1)d(S i,S i ) = 0 (2)d(S i,S j )+ d(S i,S k ) d(S j,S k ) Give two sequences S i and S j, the minimum induced distance is denoted as D(S i,S j ).

18 4 - 18 S 1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC We align the for sequence in pair. S 1 = ATGCTC S 2 = A-GAGC D(S 1,S 2 ) = 3 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3

19 4 - 19 S 1 = AT-GC-T-C S 4 = ATTGCATGC D(S 1,S 4 ) = 3 S 2 = AGAGC S 3 = TTCTG D(S 2,S 3 ) = 5 S 2 = A--G-A-GC S 4 = ATTGCATGC D(S 2,S 4 ) = 4

20 4 - 20 S 3 = -TT-C-TG- S 4 = ATTGCATGC D(S 3,S 4 ) = 4 D(S 1,S 2 )+D(S 1,S 3 )+D(S 1,S 4 ) = 9 D(S 2,S 1 )+D(S 2,S 3 )+D(S 3,S 4 ) = 12 D(S 3,S 1 )+D(S 3,S 2 )+D(S 3,S 4 ) = 12 D(S 4,S 1 )+D(S 4,S 2 )+D(S 4,S 3 ) = 11 Give a set S of k sequences, the center of this set of sequences is the sequences which minimizes

21 4 - 21 Align S 2 with S 1 S 1 = ATGCTC S 2 = A-GAGC Add S 3 by aligning S 3 with S 1 S 1 = ATGCTC S 3 = -TTCTG =>S 1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG

22 4 - 22 Add S 4 by aligning S 4 with S 1 S 1 = AT-GC-T-C S 4 = ATTGCATGC =>S 1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC App 2Opt.

23 4 - 23 The Minimal Spanning Tree Preservation Approach for Multiple Sequences Alignment S 1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATTGCATGC Step1 finds the pair wise distances optimally by the dynamic programming algorithm. S 1 = ATGCTC S 2 = ATGAGC D(S 1,S 2 ) = 2

24 4 - 24 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3 S 1 = ATGC-T-C S 4 = ATGCATGC D(S 1,S 4 ) = 2 S 2 = ATGAGC S 3 = TTCTG- D(S 2,S 3 ) = 4

25 4 - 25 S 2 = ATG-A-GC S 4 = ATGCATGC D(S 2,S 4 ) = 2 S 3 = -TTC-TG- S 4 = ATGCATGC D(S 3,S 4 ) = 4 Table: The Distance Matrix D

26 4 - 26 S1S1 S2S2 S4S4 S3S3 2 3 2 A minimal spanning tree MST(D) For e(S 1, S 2 ) S 1 = ATGCTC S 2 = ATGAGC For e(S 2, S 4 ) S 1 =(ATG-C-TC) S 2 = ATG-A-GC S 4 = ATGCATGC

27 4 - 27 For e(S 1, S 3 ) S 1 = ATG-C-TC S 2 =(ATG-A-GC) S 3 = TT--C-TG S 4 =(ATGCATGC) Table: The Distance Matrix D m

28 4 - 28 S1S1 S2S2 S3S3 2 3 2 A minimal spanning tree MST(D m ) S4S4 Theorem: MST(D) is equal to MST(D m ). Corollary: Let e(a,b) and e(c,d) be two edges on MST(D). If D(a,b) < D(c,d), then D m (a,b) < D m (c,d).


Download ppt "4 - 1 Chap 4 The Sequence Alignment Problem. 4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment."

Similar presentations


Ads by Google