Download presentation

Presentation is loading. Please wait.

Published byLeticia Woolcott Modified over 3 years ago

1
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

2
Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need –a method for scoring alignments, and –an algorithm for finding the alignment with the best score. The alignment score is calculated using –a substitution matrix –gap penalties. The algorithm for finding the best alignment is dynamic programming (DP).

3
A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

4
How many possibilities? How many different possible alignments of two sequences of length n exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C

5
How many possibilities? How many different alignments of two sequences of length n exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C 5 2.5x10 2 10 1.8x10 5 20 1.4x10 11 30 1.2x10 17 40 1.1x10 23 FYI for two sequences of length m and n, possible alignments number:

6
DP matrix GAATC C A T A C ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 i 012345012345 j 0 1 2 3 etc. 5 The value in position ( i,j ) is the score of the best alignment of the first i positions of one sequence versus the first j positions of the other sequence. GA CA init. row and column

7
DP matrix GAATC C A 5 1 T A C Moving horizontally in the matrix introduces a gap in the sequence along the left edge. GAA CA- ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

8
DP matrix GAATC C A 5 T 1 A C Moving vertically in the matrix introduces a gap in the sequence along the top edge. GA- CAT ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

9
DP matrix GAATC C A 5 T 0 A C Moving diagonally in the matrix aligns two residues GAA CAT ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

10
Initialization GAATC 0 C A T A C ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Start at top left and move progressively

11
Introducing a gap GAATC 0-4 C A T A C G-G- ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

12
GAATC 0-4 C A T A C -C-C ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Introducing a gap

13
Complete first row and column GAATC 0-4-8-12-16-20 C-4 A-8 T-12 A-16 C-20 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 ----- CATAC

14
Three ways to get to i=1, j=1 GAATC 0-4 C-8 A T A C ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 G- -C j 0 1 2 3 etc. i 012345012345

15
GAATC 0 C-4-8 A T A C ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 -G C- Three ways to get to i=1, j=1 j 0 1 2 3 etc. i 012345012345

16
GAATC 0 C-5 A T A C GCGC ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Three ways to get to i=1, j=1 i 012345012345 j 0 1 2 3 etc.

17
Accept the highest scoring of the three GAATC 0-4-8-12-16-20 C-4-5 A-8 T-12 A-16 C-20 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Then simply repeat the same rule progressively across the matrix

18
DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8? T-12 A-16 C-20 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

19
DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8 ? T-12 A-16 C-20 -4 0 -G CA G- CA --G CA- -4 -9-12 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

20
DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8 -4 T-12 A-16 C-20 -4 0 -G CA G- CA --G CA- -4 -9-12 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

21
DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8-4 T-12? A-16? C-20? ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

22
DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8-4 T-12-8 A-16-12 C-20-16 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

23
DP matrix GAATC 0-4-8-12-16-20 C-4-5? A-8-4? T-12-8? A-16-12? C-20-16? ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10

24
DP matrix GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2 What is the alignment associated with this entry? ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 (just follow the arrows back - this is called the traceback)

25
DP matrix GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 -G-A CATA

26
GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2? ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Continue and find the optimal global alignment, and its score. DP matrix

27
GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 Best alignment starts at lower right and follows traceback arrows to top left

28
GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 GA-ATC CATA-C One best traceback

29
GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 GAAT-C -CATAC Another best traceback

30
GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 ACGT A10-5 0 C 10-5 0 G 0 10-5 T 0 10 GA-ATC CATA-C GAAT-C -CATAC

31
Multiple solutions When a program returns a single sequence alignment, it may not be the only best alignment but it is guaranteed to be one of them. In our example, all of the alignments at the left have equal scores. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

32
DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

33
DP equation graphically take the max of these three

34
Dynamic programming Yes, it’s a weird name. DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.

35
Summary Scoring a pairwise alignment requires a substitution matrix and gap penalties. Dynamic programming is an efficient algorithm for finding an optimal alignment. Entry ( i,j ) in the DP matrix stores the score of the best-scoring alignment up to that position. DP iteratively fills in the matrix using a simple mathematical rule.

36
ACGT A10-50 C 10-50 G0 10-5 T 0 10 GAATC 0 A A T T C Practice problem: find a best pairwise alignment of GAATC and AATTC d = -4

Similar presentations

OK

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on intelligent manufacturing system Ppt on basel 3 Download ppt on human eye and colourful world Ppt on use of body language in communication Download ppt on chromosomes Ppt on pizza hut india Ppt on social etiquettes of israel Ppt on obesity prevention center Ppt on environmental protection Ppt on australian continent facts