Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Dynamic Programming: Sequence alignment
Outline The power of DNA Sequence Comparison The Change Problem
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Dynamic Programming: Edit Distance
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Dynamic Programming Reading Material: Chapter 7..
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Dynamic Programming I Definition of Dynamic Programming
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Dynamic Programming: Manhattan Tourist Problem Lecture 17.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
The Manhattan Tourist Problem Shane Wood 4/29/08 CS 329E.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Sequence Alignment.
SPIRE Normalized Similarity of RNA Sequences
Sequence Alignment Using Dynamic Programming
Intro to Alignment Algorithms: Global and Local
SPIRE Normalized Similarity of RNA Sequences
Sequence Alignment.
Pairwise Alignment Global & local alignment
Dynamic Programming-- Longest Common Subsequence
Pairwise Sequence Alignment (II)
Presentation transcript:

Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha

DNA Sequence Comparison: First Success Story Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene A normal growth gene switched on at the wrong time causes cancer !

Cystic Fibrosis Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands. CF primarily affects the respiratory systems in children. Search for the CF gene was narrowed to ~1 Mbp, and the region was sequenced. Scanned a database for matches to known genes. A segment in this region matched the gene for some ATP binding protein(s). These proteins are part of the ion transport channel, and CF involves sweat secretions with abnormal sodium content!

Role for Bioinformatics Gene similarities between two genes with known and unknown function alert biologists to some possibilities Computing a similarity score between two genes tells how likely it is that they have similar functions Dynamic programming is a technique for revealing similarities between genes

Motivating Dynamic Programming

Dynamic programming example: Manhattan Tourist Problem Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink * * * * * * * ** * * Source *

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink * * * * * * * ** * * Source * Dynamic programming example: Manhattan Tourist Problem

Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink”

MTP: An Example j coordinate i coordinate 13 source sink

MTP: Greedy Algorithm Is Not Optimal promising start, but leads to bad choices! source sink 18 22

MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach?

Here’s what’s wrong M(n,m) needs M(n, m-1) and M(n-1, m) Both of these need M(n-1, m-1) So M(n-1, m-1) will be computed at least twice Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse

i source 1 5 S 1,0 = 5 S 0,1 = 1 Calculate optimal path score for each vertex in the graph Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between MTP: Dynamic Programming j

MTP: Dynamic Programming (cont ’ d) source S 2,0 = 8 i S 1,1 = 4 S 0,2 = j

MTP: Dynamic Programming (cont ’ d) i source S 3,0 = 8 S 2,1 = 9 S 1,2 = 13 S 3,0 = 8 j

MTP: Dynamic Programming (cont ’ d) greedy alg. fails! i source S 3,1 = 9 S 2,2 = 12 S 1,3 = 8 j

MTP: Dynamic Programming (cont ’ d) i source j S 3,2 = 9 S 2,3 = 15

MTP: Dynamic Programming (cont ’ d) i source j S 3,3 = 16 (showing all back-traces) Done!

MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: s i, j = max s i-1, j + weight of the edge between (i-1, j) and (i, j) s i, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns)

Manhattan Is Not A Perfect Grid What about diagonals? The score at point B is given by: s B = max of s A1 + weight of the edge (A 1, B) s A2 + weight of the edge (A 2, B) s A3 + weight of the edge (A 3, B) B A3A3 A1A1 A2A2

Manhattan Is Not A Perfect Grid (cont ’ d) Computing the score for point x is given by the recurrence relation: s x = max of s y + weight of vertex (y, x) where y є Predecessors(x) Predecessors (x) – set of vertices that have edges leading to x The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once

Traveling in the Grid By the time the vertex x is analyzed, the values s y for all its predecessors y should be computed – otherwise we are in trouble. We need to traverse the vertices in some order For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal

Traversing the Manhattan Grid 3 different strategies: –a) Column by column –b) Row by row –c) Along diagonals a)b) c)

Traversing a DAG A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label How to obtain a topological ordering ?

Alignment

Aligning DNA Sequences V = ATCTGATG W = TGCATAC n = 8 m = 7 ATCTGATG TGCATAC V W match deletion insertion mismatch indels matches mismatches insertions deletions Alignment : 2 x k matrix ( k  m, n )

Longest Common Subsequence (LCS) – Alignment without Mismatches Given two sequences v = v 1 v 2 …v m and w = w 1 w 2 …w n The LCS of v and w is a sequence of positions in v : 1 < i 1 < i 2 < … < i t < m and a sequence of positions in w : 1 < j 1 < j 2 < … < j t < n such that i t -th letter of v equals to j t -letter of w and t is maximal

LCS: Example AT -- CTGATC TGCT A C elements of v elements of w -- A j coords: i coords: Matches shown in red positions in v: positions in w: 2 < 3 < 4 < 6 < 8 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid 0 0 (0,0)  (1,0)  (2,1)  (2,2)  (3,3)  (3,4)  (4,5)  (5,5)  (6,6)  (7,6)  (8,7)

Computing LCS Let v i = prefix of v of length i: v 1 … v i and w j = prefix of w of length j: w 1 … w j The length of LCS(v i,w j ) is computed by: s i, j = max s i-1, j s i, j-1 s i-1, j if v i = w j

LCS Problem as Manhattan Tourist Problem T G C A T A C i ATCTGATC j

Edit Graph for LCS Problem T G C A T A C i ATCTGATC j

T G C A T A C i ATCTGATC j Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges

Backtracking s i,j allows us to compute the length of LCS for v i and w j s m,n gives us the length of the LCS for v and w How do we print the actual LCS ? At each step, we chose an optimal decision s i,j = max (…) Record which of the choices was made in order to obtain this max

Computing LCS Let v i = prefix of v of length i: v 1 … v i and w j = prefix of w of length j: w 1 … w j The length of LCS(v i,w j ) is computed by: s i, j = max s i-1, j s i, j-1 s i-1, j if v i = w j

Printing LCS: Backtracking 1.PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if b i,j = “ “ 5. PrintLCS(b,v,i-1,j-1) 6. print v i 7. else 8. if b i,j = “ “ 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1)

From LCS to Alignment The Longest Common Subsequence problem— the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring scheme: +1 : match premium -μ : mismatch penalty -σ : indel penalty

Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels)

The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score  → = - σ = 1 if match = -µ if mismatch s i-1,j-1 +1 if v i = w j s i,j = max s i-1,j-1 -µ if v i ≠ w j s i-1,j - σ s i,j-1 - σ  : mismatch penalty σ : indel penalty

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: s i-1,j-1 + δ (v i, w j ) s i,j = max s i-1,j + δ (v i, -) s i,j-1 + δ (-, w j )

Making a Scoring Matrix Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(v i, w j ), will be less harsh than others.

Scoring Matrix: Example ARNK A5-2 R-7 3 N--70 K---6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids  will not greatly change function of protein.

Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue –Polar to polar aspartate  glutamate –Nonpolar to nonpolar alanine  valine –Similarly behaving residues leucine to isoleucine

Local vs. Global Alignment The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignment may score higher than Global Alignment

Local Alignment: Example Global alignment Local alignment Compute a “mini” Global Alignment to get Local Alignment

Local Alignments: Why? Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. Example: –Homeobox genes have a short region called the homeodomain that is highly conserved between species. –A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence

The Local Alignment Problem Goal: Find the best local alignment between two strings Input : Strings v, w and scoring matrix δ Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings

The Problem Is … Long run time O(n 4 ): - In the grid of size n x n there are ~n 2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n 2 ) time. This can be remedied by allowing every point to be the starting point

The Local Alignment Recurrence The largest value of s i,j over the whole edit graph is the score of the best local alignment. The recurrence: 0 s i,j = max s i-1,j-1 + δ (v i, w j ) s i-1,j + δ (v i, -) s i,j-1 + δ (-, w j ) Notice there is only this change from the original recurrence of a Global Alignment