Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Similar presentations


Presentation on theme: "Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."— Presentation transcript:

1 Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

2 One-minute responses The pace was good until the end. One or two more minutes for each sample problem would have been good. The pace worked well for me. I scrambled a bit to keep up on the computer but was able to do so. I feel like a fish out of water, but the pace of the class was good. With this kind of stuff it’s really easy to feel stupid if you haven’t done it before, so I like the relaxed intro feeling. I thought the pace and amount of material was fine. Thought it covered the basics, didn’t move too fast. I am very excited about this class. I enjoyed the slow and steady pacing because I am a beginner in programming. The sample problems are excellent. I wish there was time for more. It seemed that the class was a bit too fast. It was hard to keep up. I was not sure what I was doing! Perhaps prior reading would help. Thanks for moving so slowly and being so patient. Please keep it up. I have no background in programming, and though I recognize that today’s class was very basic, I struggled some. I am not sure how I would have had an easier time – perhaps if more time had been taken explaining how each component (import, math, commas, etc.) had impact on function.

3 One-minute responses It is still unclear to me why the “float” prefix is necessary. –Because otherwise Python will define the argument as a string, rather than a number. I tried importing math and then using math.sum(arg1, arg2) because it seemed logical after the log example. –This was an error on my part – I should have introduced the “+” operator before that sample problem. Is what we’re referring to as an “argument” anything that follows the program name on the command line? –Yes. Is that generally what is meant by an argument, or does it have a broader meaning? –The broader meaning is anything that is given as input to a function; e.g., 10 is the argument to math.log(10). In general, Python functions take arguments in parentheses. (The print command is an exception to this rule.) I like that the instructors walk around to check on things. It might help to explain where we can look up packages to do various functions. –Your book describes the most common packages. Otherwise, you can search at python.org. I think some simple terms could be defined like “terminal,” “system,” “argument.” The programming exercises were useful, but if I didn’t already have a basic knowledge of Perl I would be confused by the terminology (operators, variables, objects, working directory, etc.). This was my first time programming, so I was confused by terminology. After seeing the examples, I started to understand it better, though.

4 One-minute responses One thing I may like is if we could have an appendix for the commands we learned that day. I imagine maybe the book has it? –All of the commands you learn are in the book. I encourage each of you to maintain a summary of what we’ve learned thus far. Could you give supplementary problems to work on? –Yes, I will try to do this. It was helpful having a printed copy of the lab task. Can you print quotation marks? Are there other symbols that are off- bounds to print? –You can print anything. Unusual characters need to be preceded by a backslash: print "\"" It was a little unclear at first what was the relevance of the sys.argv command. I probably missed it but the only thing I could think of was more advanced warning on the text. It was hard to follow directions without having read for myself to put into proper context.

5 Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need –___________________, and –___________________. The alignment score is calculated using –___________________, and –___________________. The algorithm for finding the best alignment is __________________.

6 Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need –a method for scoring alignments, and –an algorithm for finding the alignment with the best score. The alignment score is calculated using –a substitution matrix, and –gap penalties. The algorithm for finding the best alignment is dynamic programming.

7 Review What does the 62 in BLOSUM62 mean? Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? What is the difference between a gap open and a gap extension penalty?

8 Review What does the 62 in BLOSUM62 mean? –The sequences in the alignment used to generate the matrix share 62% pairwise sequence identity. Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? –Leucine and isoleucine are biochemically very similar; consequently, a substitution of one for the other is much more likely to occur. What is the difference between a gap open and a gap extension penalty? –When we assign a score to an observed gap in an alignment, we charge a larger penalty for the first gapped position than for subsequent positions in the same gap.

9 A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: ACGT A10-50 C 10-50 G0 10-5 T 0 10

10 How many possibilities? How many different alignments of two sequences of length N exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C

11 How many possibilities? How many different alignments of two sequences of length n exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C Too many to enumerate!

12 DP matrix GAATC C A T A C -8 The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -G- CAT

13 DP matrix GAATC C A T -8 - 12 A C Moving horizontally in the matrix introduces a gap in the sequence along the left edge. -G-A CAT-

14 DP matrix GAATC C A T -8 A - 12 C Moving vertically in the matrix introduces a gap in the sequence along the top edge. -G-- CATA

15 Initialization GAATC 0 C A T A C

16 Introducing a gap GAATC 0-4 C A T A C G-G-

17 DP matrix GAATC 0-4 C A T A C -C-C

18 DP matrix GAATC 0-4 C -8 A T A C

19 DP matrix GAATC 0-4 C -5 A T A C GCGC

20 DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8 T-12 A-16 C-20 ----- CATAC

21 DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8? T-12 A-16 C-20

22 DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8-4 T-12 A-16 C-20 -4 0 -G CA G- CA --G CA- -4 -9-12

23 DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8-4 T-12? A-16? C-20?

24 DP matrix GAATC 0-4-8-12-16-20 C-4-5 A-8-4 T-12-8 A-16-12 C-20-16

25 DP matrix GAATC 0-4-8-12-16-20 C-4-5? A-8-4? T-12-8? A-16-12? C-20-16?

26 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2 What is the alignment associated with this entry?

27 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2 -G-A CATA

28 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9 A-8-45 T-12-81 A-16-122 C-20-16-2? Find the optimal alignment, and its score.

29 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117

30 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 GA-ATC CATA-C

31 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 GAAT-C CA-TAC

32 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 GAAT-C C-ATAC

33 DP matrix GAATC 0-4-8-12-16-20 C-4-5-9-13-12-6 A-8-451-3-7 T-12-810117 A-16-1221176 C-20-16-271117 GAAT-C -CATAC

34 Multiple solutions When a program returns a sequence alignment, it may not be the only best alignment. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

35 DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

36 DP in equation form

37 Dynamic programming Yes, it’s a weird name. DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.

38 Summary Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

39 Reading Nicholas review from Biotechniques, 2000.

40 ACGT A10-50 C 10-50 G0 10-5 T 0 10 GAATC 0 C A T A C Problem: find the best pairwise alignment of GAATC and CATAC. d = -4


Download ppt "Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."

Similar presentations


Ads by Google