Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence comparison: Dynamic programming

Similar presentations


Presentation on theme: "Sequence comparison: Dynamic programming"— Presentation transcript:

1 Sequence comparison: Dynamic programming
Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

2 Things people liked Liked the bioinformatics part.
Lecture was straightforward and clear. x2 I like walking through step by step examples. I like practice in class. Appreciate call for feedback. Appreciate that you addressed who should be in the class. I like learning about the algorithm behind sequence alignment. Great explanation on how scoring algorithm works. Helpful to watch you type things in the terminal. First part: I like that you didn’t just lecture. I liked that you spent a good amount of time discussing why/why not to take the class. Last problem was just challenging enough. Good intro – made me less nervous about the class. Good overview of what to expect and definitions. Defining objects was very useful for understanding Python. I like the two-fold approach to the class.

3 Pacing Python section went too fast. Demo section went a bit too fast.
I could have used more time on the sample sets. Pacing was good.

4 Other concerns Still confused on setup of Python.
I would like more time to figure out where my directories/console is (Windows). PPT font a bit small for people sitting in the back. A little intro to navigating command line/terminal would be useful. Perhaps pass out a shell cheat sheet. The class size seems too large. I was confused about how to run the various programs. Please provide more info about the curve and grading.

5 Outstanding questions
How transferrable are these concepts to other languages? What makes C++ and Java more difficult? When working in an interactive python session, are any commands stored to the environment, or is each session starting from scratch? For the substitution matrix, is there a window size? Does BLAST break alignments into windows?

6 Sequence comparison overview
Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need ___________________, and ___________________. The alignment score is calculated using The algorithm for finding the best alignment is __________________.

7 Sequence comparison overview
Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need a method for scoring alignments, and an algorithm for finding the alignment with the best score. The alignment score is calculated using a substitution matrix, and gap penalties. The algorithm for finding the best alignment is dynamic programming.

8 Review What does the 62 in BLOSUM62 mean?
Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? What is the difference between a gap open and a gap extension penalty?

9 Review What does the 62 in BLOSUM62 mean?
The sequences in the alignment used to generate the matrix share 62% pairwise sequence identity. Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? Leucine and isoleucine are biochemically very similar; consequently, a substitution of one for the other is much more likely to occur. What is the difference between a gap open and a gap extension penalty? When we assign a score to an observed gap in an alignment, we charge a larger penalty for the first gapped position than for subsequent positions in the same gap.

10 A simple alignment problem.
Problem: find the best pairwise alignment of GAATC and CATAC.

11 A simple alignment problem.
Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: A C G T 10 -5

12 How many possibilities?
GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length n exist? Too many to enumerate!

13 DP matrix -G- CAT G A T C The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8

14 DP matrix -8 G A T C -12 -G-A CAT-
Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

15 DP matrix -8 G A T C -12 -G-- CATA
Moving vertically in the matrix introduces a gap in the sequence along the top edge.

16 Initialization G A T C

17 G - Introducing a gap G A T C -4

18 - C DP matrix G A T C -4

19 DP matrix G A T C -4 -8

20 Three legal moves A diagonal move aligns a character from the left sequence with a character from the top sequence. A vertical move introduces a gap in the sequence along the top edge. A horizontal move introduces a gap in the sequence along the left edge.

21 G C DP matrix G A T C -4 -5

22 ----- CATAC DP matrix G A T C -4 -8 -12 -16 -20 -5

23 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

24 DP matrix G A T C -4 -8 -12 -16 -20 -5 -G CA G- CA --G CA- -4 -9 -12
-4 -8 -12 -16 -20 -5 -4 -4

25 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

26 DP matrix G A T C -4 -8 -12 -16 -20 -5

27 DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

28 What is the alignment associated with this entry?
DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 What is the alignment associated with this entry?

29 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 -G-A CATA

30 Find the optimal alignment, and its score.
DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 ? Find the optimal alignment, and its score.

31 DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

32 DP matrix GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

33 DP matrix GAAT-C CA-TAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

34 DP matrix GAAT-C C-ATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

35 DP matrix GAAT-C -CATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3
-4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

36 Multiple solutions GA-ATC CATA-C When a program returns a sequence alignment, it may not be the only best alignment. GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

37 DP in equation form Align sequence x and y.
F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

38 DP in equation form

39 Dynamic programming Yes, it’s a weird name.
DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.

40 Summary Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

41 G A T C Problem: find the best pairwise alignment of GAATC and CATAC.
10 -5 G A T C d = -4


Download ppt "Sequence comparison: Dynamic programming"

Similar presentations


Ads by Google