Sequence Alignment Kun-Mao Chao (趙坤茂)

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Longest Common Subsequence
Eugene W.Myers and Webb Miller. Outline Introduction Gotoh's algorithm O(N) space Gotoh's algorithm Main algorithm Implementation Conclusion.
Sequence Alignment Tutorial #2
Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Tutorial #2
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Global alignment algorithm CS 6890 Zheng Lu. Introduction Global alignments find the best match over the total length of both sequences. We do global.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Expected accuracy sequence alignment Usman Roshan.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Dynamic Programming Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Core String Edits, Alignments, and Dynamic Programming.
Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Sequence Alignment.
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence Alignment Using Dynamic Programming
Sequence Alignment 11/24/2018.
SMA5422: Special Topics in Biotechnology
Heaviest Segments in a Number Sequence
Intro to Alignment Algorithms: Global and Local
Sequence Alignment Kun-Mao Chao (趙坤茂)
A Quick Note on Useful Algorithmic Strategies
Dynamic Programming 1/15/2019 8:22 PM Dynamic Programming.
Dynamic Programming.
On the Range Maximum-Sum Segment Query Problem
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
Sequence Alignment Kun-Mao Chao (趙坤茂)
A Note on Useful Algorithmic Strategies
Dynamic Programming.
Space-Saving Strategies for Computing Δ-points
Multiple Sequence Alignment
Space-Saving Strategies for Computing Δ-points
Space-Saving Strategies for Computing Δ-points
Space-Saving Strategies for Analyzing Biomolecular Sequences
Sequence Alignment (I)
Space-Saving Strategies for Computing Δ-points
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
Multiple Sequence Alignment
Space-Saving Strategies for Computing Δ-points
Space-Saving Strategies for Computing Δ-points
Dynamic Programming Kun-Mao Chao (趙坤茂)
Presentation transcript:

Sequence Alignment Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao

GenBank 200.0

GenBank 215.0

GenBank 220.0

orz’s sequence evolution orz (kid) OTZ (adult) Orz (big head) Crz (motorcycle driver) on_ (soldier) or2 (bottom up) oΩ (back high) STO (the other way around) Oroz (me) the origin? their evolutionary relationships? their putative functional relationships?

What? The truth is more important than the facts. THETR UTHIS MOREI

Dot Matrix

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: C---TTAACT CGGATCA--T Sequence A Sequence B

Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Mismatch Match C---TTAACT CGGATCA--T Deletion gap Insertion gap

Alignment Graph C---TTAACT CGGATCA--T Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T

A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

An optimal alignment -- the alignment of maximum score Let A=a1a2…am and B=b1b2…bn . Si,j: the score of an optimal alignment between a1a2…ai and b1b2…bj With proper initializations, Si,j can be computed as follows.

Computing Si,j j w(ai,bj) w(ai,-) i w(-,bj) Sm,n

Initializations C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 C T T A A C T

S3,5 = ? C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 ? C T T A A C T

S3,5 = ? C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 7-3=4 -3+8=5 -5-3=-8 C T T A A C T

S3,5 = 5 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 10 -8 -11 -14 14 C T T A A C T optimal score

C T T A A C – T C G G A T C A T 8 – 5 –5 +8 -5 +8 -3 +8 = 14 8 – 5 –5 +8 -5 +8 -3 +8 = 14 C G G A T C A T -3 -6 -9 -12 -15 -18 -21 -24 8 5 2 -1 -4 -7 -10 -13 3 7 4 1 -2 -5 9 6 10 -8 -11 -14 14 C T T A A C T

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment?

Initializations G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 C AA T T G A

S4,2 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 ? C AA T T G A

S4,2 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 0-3=-3 -11-5=-16 -14-3=-17 C AA T T G A

S5,5 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 ? C AA T T G A

S5,5 = ? G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 16-3=13 19-5=14 C AA T T G A

S5,5 = 14 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 Match: 8 Mismatch: -5 Gap symbol: -3 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 14 24 21 18 32 29 1 27 C AA T T G A optimal score

C A A T - T G A G A A T C T G C -5 +8 +8 +8 -3 +8 +8 -5 = 27 -5 +8 +8 +8 -3 +8 +8 -5 = 27 G A A T C T G C -3 -6 -9 -12 -15 -18 -21 -24 -5 -8 -11 -14 -4 -7 -10 -13 3 11 8 5 2 -1 19 16 13 10 7 -17 14 24 21 18 32 29 1 27 C AA T T G A

Global Alignment vs. Local Alignment

Maximum-sum interval Given a sequence of real numbers a1a2…an , find a consecutive subsequence with the maximum sum. 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 For each position, we can compute the maximum-sum interval ending at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

Computing a segment sum in O(1) time? Input: a sequence of real numbers a1a2…an Query: the sum of ai ai+1…aj

Computing a segment sum in O(1) time prefix-sum(i) = a1+a2+…+ai all n prefix sums are computable in O(n) time. sum(i, j) = prefix-sum(j) – prefix-sum(i-1) j i prefix-sum(j) prefix-sum(i-1)

Maximizing sum(i, j) sum(i, j) = prefix-sum(j) – prefix-sum(i-1) O(n)-time Method 1 sum(i, j) = prefix-sum(j) – prefix-sum(i-1) For each location j, prefix-sum(j) is fixed. To compute the maximum-sum interval ending at position j can be done by finding the minimum prefix-sum before position j. j i prefix-sum(j) prefix-sum(i-1)

Maximum-sum interval (The recurrence relation) Define S(i) to be the maximum sum of the intervals ending at position i. O(n)-time Method 2 ai If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

Maximum-sum interval (Tabular computation) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum sum

Maximum-sum interval (Traceback) 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 S(i) 9 6 7 14 –1 2 5 1 3 –4 6 4 12 16 7 The maximum-sum interval: 6 -2 8 4

An optimal local alignment Si,j: the score of an optimal local alignment ending at (i, j) between a1a2…ai and b1b2…bj. With proper initializations, Si,j can be computed as follows.

local alignment C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 ? C T T A A C T

local alignment C G G A T C A T 8 5 2 3 13 11 C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 2-3=-1 5+8=13 3-3=0 C T T A A C T

local alignment C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 C G G A T C A T 8 5 2 3 13 11 10 7 18 C T T A A C T The best score

A – C - T A T C A T 8-3+8-3+8 = 18 C G G A T C A T 8 5 2 3 13 11 10 7 8 5 2 3 13 11 10 7 18 C T T A A C T The best score

Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment?

Did you get it right? G A A T C T G C 8 5 2 3 16 13 10 7 4 24 21 18 15 8 5 2 3 16 13 10 7 4 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A

A A T – T G A A T C T G 8+8+8-3+8+8 = 37 G A A T C T G C 8 5 2 3 16 13 10 7 4 24 21 18 15 12 19 29 26 23 37 34 32 C AA T T G A

Osamu Gotoh

Affine gap penalties C - - - T T A A C T C G G A T C A - - T Match: +8 (w(a, b) = 8, if a = b) Mismatch: -5 (w(a, b) = -5, if a ≠ b) Each gap symbol: -3 (w(-,b) = w(a,-) = -3) Each gap is charged an extra gap-open penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score: 12 – 4 – 4 = 4

Affine gap panalties A gap of length k is penalized x + k·y. gap-open penalty Three cases for alignment endings: ...x ...x ...x ...- ...- ...x gap-symbol penalty an aligned pair This is the same as the scoring scheme that penalizes the first symbol x + y and an extended symbol y. a deletion an insertion

Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Affine gap penalties (A gap of length k is penalized x + k·y.)

Affine gap penalties S I D S I D -y w(ai,bj) -x-y S I D D -x-y I S -y

Constant gap penalties Match: +8 (w(a, b) = 8, if a = b) Mismatch: -5 (w(a, b) = -5, if a ≠ b) Each gap symbol: 0 (w(-,b) = w(a,-) = 0) Each gap is charged a constant penalty: -4. -4 -4 C - - - T T A A C T C G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 Alignment score: 27 – 4 – 4 = 19

Constant gap penalties Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj.

Constant gap penalties

Restricted affine gap panalties A gap of length k is penalized x + f(k)·y. where f(k) = k for k <= c and f(k) = c for k > c Five cases for alignment endings: ...x ...x ...x ...- ...- ...x and 5. for long gaps an aligned pair a deletion an insertion

Restricted affine gap penalties

D(i, j) vs. D’(i, j) Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length <= c D(i, j) >= D’(i, j) Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j)

Max{S(i,j)-x-ky, S(i,j)-x-cy}