Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Slides:



Advertisements
Similar presentations
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Sequence comparison: Introduction and motivation Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 8 Alignment of pairs of sequence Local and global alignment
COFFEE: an objective function for multiple sequence alignments
Pairwise Sequence Alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Proof Points Key ideas when proving mathematical ideas.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Heuristic alignment algorithms and cost matrices
Similar Sequence Similar Function Charles Yan Spring 2006.
Recap 3 different types of comparisons 1. Whole genome comparison 2. Gene search 3. Motif discovery (shared pattern discovery)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignments Revisited
Proposal 13 HUMAN CENTRIC COMPUTING (COMP106) ASSIGNMENT 2.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Genome Sciences 373 Genome Informatics Quiz Section 4 April 21, 2015.
“SAS macros are just text substitution!” “ARRRRGGHHH!!!”
James Matte Nicole Calbi SUNY Fredonia AMTNYS October 28 th, 2011.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
1 What NOT to do I get sooooo Frustrated! Marking the SAME wrong answer hundreds of times! I will give a list of mistakes which I particularly hate marking.
Multiple testing correction
Mark Turner Cuesta College Bridging The Technology Gap - Helping Students Succeed in College Algebra.
An Introduction to Textual Programming
Introduction to Python Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Input, Output, and Processing
Introduction to Bioinformatics Algorithms Sequence Alignment.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Inferring phylogenetic trees: Maximum likelihood methods Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note- taking materials.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Sequence Alignment.
Algorithms and Pseudocode
DSMA 0399 Comments of Past Students. DSMA 0399 Student Comments “Before this class as you probably remember I would not even accept that x or y could.
Hidden Markov Models BMI/CS 576
Pairwise sequence comparison
3.5 General feeling that knowledge of hydrology has improved … but more is needed.
Introduction to Python
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Introduction to Python
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
GENOME 559: Introduction to statistical and computational genomics
Sequence comparison: Dynamic programming
Pairwise sequence Alignment.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Local alignment
Sequence comparison: Traceback
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Sequence comparison: Introduction and motivation
Presentation transcript:

Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute responses The pace was good until the end. One or two more minutes for each sample problem would have been good. The pace worked well for me. I scrambled a bit to keep up on the computer but was able to do so. I feel like a fish out of water, but the pace of the class was good. With this kind of stuff it’s really easy to feel stupid if you haven’t done it before, so I like the relaxed intro feeling. I thought the pace and amount of material was fine. Thought it covered the basics, didn’t move too fast. I am very excited about this class. I enjoyed the slow and steady pacing because I am a beginner in programming. The sample problems are excellent. I wish there was time for more. It seemed that the class was a bit too fast. It was hard to keep up. I was not sure what I was doing! Perhaps prior reading would help. Thanks for moving so slowly and being so patient. Please keep it up. I have no background in programming, and though I recognize that today’s class was very basic, I struggled some. I am not sure how I would have had an easier time – perhaps if more time had been taken explaining how each component (import, math, commas, etc.) had impact on function.

One-minute responses It is still unclear to me why the “float” prefix is necessary. –Because otherwise Python will define the argument as a string, rather than a number. I tried importing math and then using math.sum(arg1, arg2) because it seemed logical after the log example. –This was an error on my part – I should have introduced the “+” operator before that sample problem. Is what we’re referring to as an “argument” anything that follows the program name on the command line? –Yes. Is that generally what is meant by an argument, or does it have a broader meaning? –The broader meaning is anything that is given as input to a function; e.g., 10 is the argument to math.log(10). In general, Python functions take arguments in parentheses. (The print command is an exception to this rule.) I like that the instructors walk around to check on things. It might help to explain where we can look up packages to do various functions. –Your book describes the most common packages. Otherwise, you can search at python.org. I think some simple terms could be defined like “terminal,” “system,” “argument.” The programming exercises were useful, but if I didn’t already have a basic knowledge of Perl I would be confused by the terminology (operators, variables, objects, working directory, etc.). This was my first time programming, so I was confused by terminology. After seeing the examples, I started to understand it better, though.

One-minute responses One thing I may like is if we could have an appendix for the commands we learned that day. I imagine maybe the book has it? –All of the commands you learn are in the book. I encourage each of you to maintain a summary of what we’ve learned thus far. Could you give supplementary problems to work on? –Yes, I will try to do this. It was helpful having a printed copy of the lab task. Can you print quotation marks? Are there other symbols that are off- bounds to print? –You can print anything. Unusual characters need to be preceded by a backslash: print "\"" It was a little unclear at first what was the relevance of the sys.argv command. I probably missed it but the only thing I could think of was more advanced warning on the text. It was hard to follow directions without having read for myself to put into proper context.

Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need –___________________, and –___________________. The alignment score is calculated using –___________________, and –___________________. The algorithm for finding the best alignment is __________________.

Sequence comparison overview Problem: Find the “best” alignment between a query sequence and a target sequence. To solve this problem, we need –a method for scoring alignments, and –an algorithm for finding the alignment with the best score. The alignment score is calculated using –a substitution matrix, and –gap penalties. The algorithm for finding the best alignment is dynamic programming.

Review What does the 62 in BLOSUM62 mean? Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? What is the difference between a gap open and a gap extension penalty?

Review What does the 62 in BLOSUM62 mean? –The sequences in the alignment used to generate the matrix share 62% pairwise sequence identity. Why does leucine and isoleucine get a BLOSUM62 score of 2, whereas leucine and aspartic acid get a score of -4? –Leucine and isoleucine are biochemically very similar; consequently, a substitution of one for the other is much more likely to occur. What is the difference between a gap open and a gap extension penalty? –When we assign a score to an observed gap in an alignment, we charge a larger penalty for the first gapped position than for subsequent positions in the same gap.

A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: ACGT A10-50 C G T 0 10

How many possibilities? How many different alignments of two sequences of length N exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C

How many possibilities? How many different alignments of two sequences of length n exist? GAATC CATAC GAATC- CA-TAC GAAT-C C-ATAC GAAT-C CA-TAC -GAAT-C C-A-TAC GA-ATC CATA-C Too many to enumerate!

DP matrix GAATC C A T A C -8 The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -G- CAT

DP matrix GAATC C A T A C Moving horizontally in the matrix introduces a gap in the sequence along the left edge. -G-A CAT-

DP matrix GAATC C A T -8 A - 12 C Moving vertically in the matrix introduces a gap in the sequence along the top edge. -G-- CATA

Initialization GAATC 0 C A T A C

Introducing a gap GAATC 0-4 C A T A C G-G-

DP matrix GAATC 0-4 C A T A C -C-C

DP matrix GAATC 0-4 C -8 A T A C

DP matrix GAATC 0-4 C -5 A T A C GCGC

DP matrix GAATC C-4-5 A-8 T-12 A-16 C CATAC

DP matrix GAATC C-4-5 A-8? T-12 A-16 C-20

DP matrix GAATC C-4-5 A-8-4 T-12 A-16 C G CA G- CA --G CA

DP matrix GAATC C-4-5 A-8-4 T-12? A-16? C-20?

DP matrix GAATC C-4-5 A-8-4 T-12-8 A C-20-16

DP matrix GAATC C-4-5? A-8-4? T-12-8? A-16-12? C-20-16?

DP matrix GAATC C A-8-45 T A C What is the alignment associated with this entry?

DP matrix GAATC C A-8-45 T A C G-A CATA

DP matrix GAATC C A-8-45 T A C ? Find the optimal alignment, and its score.

DP matrix GAATC C A T A C

DP matrix GAATC C A T A C GA-ATC CATA-C

DP matrix GAATC C A T A C GAAT-C CA-TAC

DP matrix GAATC C A T A C GAAT-C C-ATAC

DP matrix GAATC C A T A C GAAT-C -CATAC

Multiple solutions When a program returns a sequence alignment, it may not be the only best alignment. GA-ATC CATA-C GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

DP in equation form

Dynamic programming Yes, it’s a weird name. DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.

Summary Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

Reading Nicholas review from Biotechniques, 2000.

ACGT A10-50 C G T 0 10 GAATC 0 C A T A C Problem: find the best pairwise alignment of GAATC and CATAC. d = -4