CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
CS5263 Bioinformatics Lecture 2: Introduction to molecular biology.
Problem A subsequence is a sequence derived from another sequence by deleting some elements without changing the order of the remaining elements. Using.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequencing and Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
CS 262 Discussion Section 1. Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Class 2: Basic Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Chapter 3 Computational Molecular Biology Michael Smith
Minimum Edit Distance Definition of Minimum Edit Distance.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
Sequence Similarity.
A short introduction to biology Life Two categories: –Prokaryotes (e.g. bacteria) Unicellular No nucleus –Eukaryotes (e.g. fungi, plant, animal) Unicellular.
CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
Definition of Minimum Edit Distance
Lecture 5: Local Sequence Alignment Algorithms
Intro to Alignment Algorithms: Global and Local
CS 3343: Analysis of Algorithms
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Dynamic Programming.
Dynamic Programming-- Longest Common Subsequence
Presentation transcript:

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment

Roadmap Review of last lecture –Biology –Dynamic programming Sequence alignment

PolymerMonomer DNADeoxyribonucleotides RNARibonucleotides ProteinAmino Acid

Carboxyl group Amino group Protein zoom-in Side chain R H2N RRRRR COOH N-terminal C-terminal … Composed of a chain of amino acids. R | H 2 N--C--COOH | H

Genome, Chromosome, Gene

DNA Replication The process of copying a double-stranded DNA molecule –Semi-conservative 5’-ACATGATAA-3’ 3’-TGTACTAT-5’  5’-ACATGATAA-3’ 3’-TGTACTATT-5’

Transcription (where genetic information is stored) (for making mRNA) Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C

The Genetic Code Third letter

Translation The sequence of codons is translated to a sequence of amino acids Gene: -GCT TGT TTA CGA ATT- mRNA: -GCU UGU UUA CGA AUU - Peptide: - Alu - Cys - Leu - Arg - Ile – Start codon: AUG –Also code Met –Stop codon: UGA, UAA, UAA

Dynamic programming What is dynamic programming? –Solve an optimization problem by tabulating sub-problem solutions (memorization) rather than re-computing them

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems –Solutions to sub-problems are independent Overlapping sub-problems –Some sub-problems appear in many solutions –We should not solve each sub-problem for more than once Memorization and reuse –Carefully choose the order that sub-problems are solved –Tabulate the solutions –Bottom-up

Example Find the shortest path in a grid s g (0,0) (3,3)

Optimal substructure If a path P(s, g) is optimal, any sub-path, P(s,x), where x is on P(s,g), is also optimal Proof by contradiction –If the path between P(s,x) is not the shortest, i.e., P’(s,x) < P(s,x) –Construct a new path P’(s,g) = P’(s,x) + P(x, g) –P’(s,g) P(s,g) is not the shortest –Contradiction

Overlapping sub-problems Some sub-problems are used by many paths (0,0) -> (2,0) used by 3 paths

Memorization and reuse Easy to tabulate and reuse –Number of sub-problems ~ number of nodes –P(s, x), for x in all nodes except s and g Find an order such that no sub-problems need to be recomputed –First compute the smallest sub-problems –Use solutions of small sub-problems to solve large sub-problems

Example: shortest path

Example: shortest path

Example: shortest path

Analysis For a nxn grid Enumeration: –number of paths = (2n!)/(n!)^2 –Each path has 2n steps –Total operation: 2n * (2n!) / (n!)^2 = O(2^(2n)) Recursive call: O(2^(2n)) DP: O(n^2)

EnumerationRecursionDP N= N=52,5201,03260 N=103,695,1201,048,576420

Example: Fibonacci Seq F(n) = F(n-1) + F(n-2), F(0) = F(1) = 1 Function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2);

Time complexity: O(1.62^n)

Example: Fibonacci Seq function fib(n) F[0] = 1;F[1] = 1; For i = 2 to n F[n] = F[n-1] + F[n-2]; End Return F[n];

Time: O(n), space: O(n)

What if it is not so easy to figure out an order to fill in the table? Exercise

Today’s lecture Sequence alignment –Global alignment

Why seq alignment? Similar sequences often have similar origin or function –Two genes are said to be homologous if they share a common evolutionary history. –Evolutionary history can tell us a lot about properties of a given gene –Homology can be inferred from similarity between the genes New protein sequences are always compared to sequence databases to search for proteins with same or similar functions Most widely used computational tools in biology

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C Sequence edits: Mutation, deletion, insertion

Evolutionary Rates OK X X Still OK? next generation

Sequence conservation implies function

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two string S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?

The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d

More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor –It is usually defined by log likelihood ratio (Durbin book) –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM

An example substitution matrix ACGT A3-2-2 C3 G3-2 T3

Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

How to find it? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations

Dynamic Programming We will now describe a dynamic programming algorithm Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j

Dynamic Programming (cont ’ d) Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~ - 3.y N aligns to a gap ~~~~~~~ - ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

Therefore: F(M-1, N-1) +  (X M,Y N ) F(M,N) = max F(M-1, N) – d F(M, N-1) – d  (X M,Y N ) = m if X M = Y N, and –s otherwise Each sub-problem can be solved recursively

Generalize: F(i-1, j-1) +  (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d Be careful with the boundary conditions

Remember: –The recursive formula is for understanding the relationship between sub-problems –We cannot afford to really solve them recursively Number of sub-problems: –Each corresponds to calculating an F(i, j) –O(MN) of them –Solve all of them

What order to fill? F(0,0) F(M,N)

F(i-1, j-1) +  (X i,Y j ) F(i, j) = max F(i-1, j) – d F(i, j-1) – d F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1) [case 1] [case 2] [case 3] 1 2 3

What order to fill? F(0,0) F(M,N)

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T A F(i,j) i = j =

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T-2 A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 F(i,j) i =

Example x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i =

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

Trace-back AGTA A10 -2 T 0010 A-3 02 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = -1 d = -1 j = F(i,j) i =

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

Trace-back x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

In some cases, trace-back may be very time consuming Alternative solution: remember where you come from! –Trade-off: more memory

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A T-2 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = -1 d = -1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i,j) i =

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

A variant of the basic algorithm: Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1

Different types of overlaps x y x y

A non-bio variant Shell command “diff” in unix –Given file1 and file2 –Find the difference between file1 and file2 –Similar to sequence alignment –How to score? Longest common subsequence (LCS) Match has score 1 No mismatch penalty No gap penalty

File1 A B C D E F File2 G B C E F

File1 A B C D E F File2 G B C - E F $ diff file1 file2 1c1 < A --- > G 4c4 < D --- > - LCS = 4

The LCS variant Changes: 1.Initialization For all i, j, F(i, 0) = F(0, j) = 0 2.Filling in table F(i-1,j) F(i, j) = max F(i, j-1) F(i-1, j-1) + σ(x i, y j ) where σ(x i, y j ) = 1 if x i = y j and 0 otherwise. 3.Termination max i F(i, N) F OPT = max max j F(M, j)

What happens if you have 1 million lines of text in each file? Slow –What if the majority of the two files are the same? (e.g., two versions of a software) –Bounded DP Memory inefficient –At least 1000 GB memory –Linear-space algorithm, same time complexity

See you next week