Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:

Slides:



Advertisements
Similar presentations
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Advertisements

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Linear-Space Alignment. Subsequences and Substrings Definition A string x’ is a substring of a string x, if x = ux’v for some prefix string u and suffix.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Alignments and Comparative Genomics. Welcome to CS374! Today: Serafim: Alignments and Comparative Genomics Omkar: Administrivia.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
CS262 Lecture 4, Win07, Batzoglou Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Sequence Alignment. CS262 Lecture 3, Win06, Batzoglou Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Sequence Alignment. 2 Sequence Comparison Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Minimum Edit Distance Definition of Minimum Edit Distance.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics I: Sequence Alignment Eric Xing Lecture.
Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Similarity.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
1 Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y.
Pairwise sequence Alignment.
Bioinformatics Algorithms and Data Structures
Presentation transcript:

Sequence Alignment Cont’d

Evolution

Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function: Match: +m Mismatch: -s Gap:-d Score F = (# matches)  m - (# mismatches)  s – (#gaps)  d

Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, an alignment is an assignment of gaps to positions 0,…, N in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling-in partial alignments a.For each i = 1……M For eachj = 1……N F(i-1,j-1) + s(x i, y j ) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j)= LEFT,if [case 2] UP,if [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization:F(0, j) = F(i, 0) = 0 0 Iteration:F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(x i, y j )

Scoring the gaps more accurately Simple, linear gap model: Gap of length n incurs penaltyn  d However, gaps usually occur in bunches Convex gap penalty function:  (n): for all n,  (n + 1) -  (n)   (n) -  (n – 1) Algorithm: O(N 3 ) time, O(N 2 ) space  (n)

Compromise: affine gaps  (n) = d + (n – 1)  e | | gap gap open extend To compute optimal alignment, At position i,j, need to “remember” best score if gap is open best score if gap is not open F(i, j):score of alignment x 1 …x i to y 1 …y j if if x i aligns to y j if G(i, j):score if x i aligns to a gap after y j if H(i, j): score if y j aligns to a gap after x i V(i, j) = best score of alignment x 1 …x i to y 1 …y j d e  (n)

Needleman-Wunsch with affine gaps Why do we need two matrices? x i aligns to y j x 1 ……x i-1 x i x i+1 y 1 ……y j-1 y j - 2.x i aligns to a gap x 1 ……x i-1 x i x i+1 y 1 ……y j …- - Add -d Add -e G(i+1, j) = V(i, j) – d G(i+1, j) = G(i, j) – e

Needleman-Wunsch with affine gaps Initialization:V(i, 0) = d + (i – 1)  e V(0, j) = d + (j – 1)  e Iteration: V(i, j) = max{ F(i, j), G(i, j), H(i, j) } F(i, j) = V(i – 1, j – 1) + s(x i, y j ) V(i – 1, j) – d G(i, j) = max G(i – 1, j) – e V(i, j – 1) – d H(i, j) = max H(i, j – 1) – e Termination: similar Time? Space?

To generalize a little… … think of how you would compute optimal alignment with this gap function ….in time O(MN)  (n)

Bounded Dynamic Programming Assume we know that x and y are very similar Assumption: # gaps(x, y) M ) xixi Then,|implies | i – j | < k(N) yj yj We can align x and y more efficiently: Time, Space: O(N  k(N)) << O(N 2 )

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same Easy to extend to the affine gap case x 1 ………………………… x M y 1 ………………………… y N k(N)

Linear-Space Alignment

Hirschberg’s algortihm Longest common subsequence  Given sequences s = s 1 s 2 … s m, t = t 1 t 2 … t n,  Find longest common subsequence u = u 1 … u k Algorithm: F(i-1, j) F(i, j) = maxF(i, j-1) F(i-1, j-1) + [1, if s i = t j ; 0 otherwise] Hirschberg’s algorithm solves this in linear space

F(i,j) Introduction: Compute optimal score It is easy to compute F(M, N) in linear space Allocate ( column[1] ) Allocate ( column[2] ) For i = 1….M If i > 1, then: Free( column[i – 2] ) Allocate( column[ i ] ) For j = 1…N F(i, j) = …

Linear-space alignment To compute both the optimal score and the optimal alignment: Divide & Conquer approach: Notation: x r, y r : reverse of x, y E.g.x = accgg; x r = ggcca F r (i, j): optimal score of aligning x r 1 …x r i & y r 1 …y r j same as aligning x M-i+1 …x M & y N-j+1 …y N

Linear-space alignment Lemma: F(M, N) = max k=0…N ( F(M/2, k) + F r (M/2, N-k) ) x y M/2 k*k* F(M/2, k) F r (M/2, N-k)

Linear-space alignment Now, using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N-k) PLUS the backpointers

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + F r (M/2, N-k) Also, we can trace the path exiting column M/2 from k * k*k* k * …… M/2 M/2+1 …… M M+1

Linear-space alignment Iterate this procedure to the left and right! N-k * M/2 k*k*

Linear-space alignment Hirschberg’s Linear-space algorithm: MEMALIGN(l, l’, r, r’):(aligns x l …x l’ with y r …y r’ ) 1.Let h =  (l’-l)/2  2.Find (in Time O((l’ – l)  (r’-r)), Space O(r’-r)) the optimal path,L h, entering column h-1, exiting column h Let k 1 = pos’n at column h – 2 where L h enters k 2 = pos’n at column h + 1 where L h exits 3.MEMALIGN(l, h-2, r, k 1 ) 4.Output L h 5.MEMALIGN(h+1, l’, k 2, r’) Top level call: MEMALIGN(1, M, 1, N)

Linear-space alignment Time, Space analysis of Hirschberg’s algorithm: To compute optimal path at middle column, For box of size M  N, Space: 2N Time:cMN, for some constant c Then, left, right calls cost c( M/2  k * + M/2  (N-k * ) ) = cMN/2 All recursive calls cost Total Time: cMN + cMN/2 + cMN/4 + ….. = 2cMN = O(MN) Total Space: O(N) for computation, O(N+M) to store the optimal alignment

The Four-Russian Algorithm A useful speedup of Dynamic Programming [ Arlazarov, Dinic, Kronrod, Faradzev 1970]

Main Observation Within a rectangle of the DP matrix, values of D depend only on the values of A, B, C, and substrings x l...l’, y r…r’ Definition: A t-block is a t  t square of the DP matrix Idea: Divide matrix in t-blocks, Precompute t-blocks Speedup: O(t) A B C D xlxl x l’ yryr y r’ t

The Four-Russian Algorithm Main structure of the algorithm: Divide N  N DP matrix into K  K log 2 N-blocks that overlap by 1 column & 1 row For i = 1……K For j = 1……K Compute D i,j as a function of A i,j, B i,j, C i,j, x[l i …l’ i ], y[r j …r’ j ] Time: O(N 2 / log 2 N) times the cost of step 4 t t t

The Four-Russian Algorithm Another observation: ( Assume m = 0, s = 1, d = 1, find minimum # of edits ) Lemma. Two adjacent cells of F(.,.) differ by at most 1 Gusfield’s book covers case where m = 0, called the edit distance (p. 216): minimum # of substitutions + gaps to transform one string to another

The Four-Russian Algorithm Proof of Lemma: 1.Same row: a.F(i, j) – F(i – 1, j)  -1 At worst, one more gap:x 1 ……x i-1 x i y 1 ……y j – b.F(i – 1, j) – F(i, j)  -1 F(i, j) F(i – 1, j) F(i – 1, j) – F(i, j) x 1 ……x i-1 x i x 1 ……x i-1 – y 1 ……y a-1 y a y a+1 …y j y 1 ……y a-1 y a y a+1 …y j  -1 x 1 ……x i-1 x i x 1 ……x i-1 y 1 ……y a-1 – y a …y j y 1 ……y a-1 y a …y j +1 2.Same column: similar argument

The Four-Russian Algorithm Proof of Lemma: 3.Same diagonal: a.F(i, j) – F(i – 1, j – 1)  -1 At worst, one additional mismatch in F(i, j) b.F(i – 1, j – 1) – F(i, j)  -1 F(i, j)F(i – 1, j – 1) F(i – 1, j – 1) – F(i, j) x 1 ……x i-1 x i x 1 ……x i-1 | y 1 ……y i-1 y j y 1 ……y j-1  0 x 1 ……x i-1 x i x 1 ……x i-1 y 1 ……y a-1 – y a …y j y 1 ……y a-1 y a …y j +1

The Four-Russian Algorithm Definition: The offset vector is a t-long vector of values from {-1, 0, 1}, where the first entry is 0 If we know the value at A, and the top row, left column offset vectors, and x l ……x l’, y r ……y r’, Then we can find D A B C D xlxl x l’ yryr y r’ t

The Four-Russian Algorithm Example: x = AACT y = CACT AAC T C A C T

The Four-Russian Algorithm Example: x = AACT y = CACT AAC T C A C T

The Four-Russian Algorithm Definition: The offset function of a t- block is a function that for any given offset vectors of top row, left column, and x l ……x l’, y r ……y r’, produces offset vectors of bottom row, right column A B C D xlxl x l’ yryr y r’ t

The Four-Russian Algorithm We can pre-compute the offset function: 3 2(t-1) possible input offset vectors 4 2t possible strings x l ……x l’, y r ……y r’ Therefore 3 2(t-1)  4 2t values to pre-compute Or, t 2  3 2(t-1)  4 2t time for pre-computation We can keep all these values in a table, look up in O(1) time if we assume constant-lookup RAM for log-sized inputs Then, letting t = (log 3  4 N)/2, we get Pre-computation: O( N  log 2 N ) Running Time: N 2 /t 2 = O(N 2 /log 2 N)

The Four-Russian Algorithm Four-Russians Algorithm: (Arlazarov, Dinic, Kronrod, Faradzev) 1.Cover the DP table with t-blocks 2.Initialize values F(.,.) in first row & column 3.Row-by-row, use offset values at leftmost column and top row of each block, to find offset values at rightmost column and bottom row 4.Let Q = total of offsets at row N F(N, N) = Q + F(N, 0)

The Four-Russian Algorithm t t t

Indexing-based Local Aligners BLAST, WU-BLAST, BlastZ, MegaBLAST, BLAT, PatternHunter, ……

Some useful applications of alignments Given a newly discovered gene,  Does it occur in other species?  How fast does it evolve? Assume we try Smith-Waterman: The entire genomic database Our new gene

Some useful applications of alignments Given a newly sequenced organism, Which subregions align with other organisms?  Potential genes  Other biological characteristics Assume we try Smith-Waterman: The entire genomic database Our newly sequenced mammal 3 

Next lecture Methods that find local alignments in Time O(much-faster-than-quadratic)