Pairwise Sequence Alignment

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
1 Выравнивание двух последовательностей. 2 AGC A A A C
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Local alignment
Pairwise Sequence Alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Presentation transcript:

Pairwise Sequence Alignment BMI/CS 576 www.biostat.wisc.edu/bmi576 Colin Dewey cdewey@biostat.wisc.edu Fall 2010

Overview What does it mean to align sequences? How do we cast sequence alignment as a computational problem? What algorithms exist for solving this computational problem?

What is sequence alignment? Pattern matching ...CATCGATGACTATCCG... ATGACTGT Database searching . CATGCATGCTGCGTAC CATGGTTGCTCACAAGTAC CATGCCTGCTGCGTAA TACGTGCCTGACCTGCGTAC CATGCCGAATGCTG CATGCTTGCTGGCGTAAA suffix trees, locality-sensitive hashing,... BLAST Optimization problem Needleman-Wunsch, Smith-Waterman,... Statistical problem Pair HMMs, TKF, Karlin-Altschul statistic...

DNA sequence edits Substitutions: ACGA AGGA Insertions: ACGA ACCGGAGA Deletions: ACGGAGA AGA Transpositions: ACGGAGA AAGCGGA Inversions: ACGGAGA ACTCCGA

Alignment scales For short DNA sequences (gene scale) we will generally only consider Substitutions: cause mismatches in alignments Insertions/Deletions: cause gaps in alignments For longer DNA sequences (genome scale) we will consider additional events Transposition Inversion In this course we will focus on the case of short sequences

What is a pairwise alignment? We will focus on evolutionary alignment matching of homologous positions in two sequences positions with no homologous pair are matched with a space ‘-’ A group of consecutive spaces is a gap CA--GATTCGAAT CGCCGATT---AT gap

Dot plots Not technically an “alignment” But gives picture of correspondence between pairs of sequences Dot represents similarity between segments of the two sequences

Gaps sequences may have diverged from a common ancestor through various types of mutations: substitutions (ACGA AGGA) insertions (ACGA ACCGGAGA) deletions (ACGGAGA AGA) the latter two will result in gaps in alignments

The Role of Homology character: some feature of an organism (could be molecular, structural, behavioral, etc.) homology: the relationship of two characters that have descended from a common ancestor homologous characters tend to be similar due to their common ancestry and evolutionary pressures thus we often infer homology from similarity thus we can sometimes infer structure/function from sequence similarity

Homology Example: Evolution of the Globins globins involved in transport of oxygen – ancient and diverse family Recall that hemoglobin is composed of four globin alpha polypeptides: 2 alpha and 2 beta Various versions of the alpha and beta subunits that are expressed at different times in development (e.g. embryonic, fetal, adult). Alphas and betas evolved from a common ancestor. A gene duplication event gave rise to precursor of alphas and precursor of betas. These evolved separately and then gene duplication events gave rise to various alpha and beta forms.

Homology homologous sequences can be divided into three groups orthologous sequences: sequences that diverged due to a speciation event (e.g. human a-globin and mouse a-globin) paralogous sequences: sequences that diverged due to a gene duplication event (e.g. human a-globin and human b-globin, various versions of both ) xenologous sequences: sequences for which the history of one of them involves interspecies transfer since the time of their common ancestor

Insertions/Deletions and Protein Structure Why is it that two “similar” sequences may have large insertions/deletions? some insertions and deletions may not significantly affect the structure of a protein loop structures: insertions/deletions here not so significant

Example Alignment: Globins figure at right shows prototypical structure of globins figure below shows part of alignment for 8 globins (-’s indicate gaps)

Issues in Sequence Alignment the sequences we’re comparing probably differ in length there may be only a relatively small region in the sequences that matches we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) variable length regions may have been inserted/deleted from the common ancestral sequence

Two main classes of pairwise alignment Global: All positions are aligned Local: A (contiguous) subset of positions are aligned CA--GATTCGAAT CGCCGATT---AT semi-global: find best match without penalizing gaps on the ends of the alignment ..GATT..... ....GATT..

Pairwise Alignment: Task Definition Given a pair of sequences (DNA or protein) a method for scoring a candidate alignment Do find an alignment for which the score is maximized

Scoring An Alignment: What Is Needed? substitution matrix S(a,b) indicates score of aligning character a with character b gap penalty function w(k) indicates cost of a gap of length k

Linear Gap Penalty Function different gap penalty functions require somewhat different algorithms the simplest case is when a linear gap function is used where s is a constant we’ll start by considering this case

Scoring an Alignment the score of an alignment is the sum of the scores for pairs of aligned characters plus the scores for gaps example: given the following alignment VAHV---D--DMPNALSALSDLHAHKL AIQLQVTGVVVTDATLKNLGSVHVSKG we would score it by S(V,A) + S(A,I) + S(H,Q) + S(V,L) + 3s + S(D,G) + 2s …

The Space of Global Alignments some possible global alignments for ELV and VIS ELV VIS -ELV VIS- --ELV VIS-- ELV- -VIS E-LV VIS- ELV-- --VIS EL-V -VIS

Number of Possible Alignments given sequences of length m and n assume we don’t count as distinct and we can have as few as 0 and as many as min{m, n} aligned pairs therefore the number of possible alignments is given by C- -G -C G- Two strings of length=100 No gaps: 2N-1 possible alignments 199 in this example Gaps: 10**77 in this case See p. 188 in Waterman for an analysis of this issue. There must be k aligned pairs, 0 < k < min{n,m} There are n-choose-k ways to select the residues in n, and m-choose-k ways to select the residues in m So we have sum_{k>=0} (n-choose-k)(m-choosek) = (n+m choose k) possible alignments (for a local)

Number of Possible Alignments there are possible global alignments for 2 sequences of length n e.g. two sequences of length 100 have possible alignments but we can use dynamic programming to find an optimal alignment efficiently Stirling’s formula n! ~ sqrt{2\pin}(n/e)^n Two strings of length=100 No gaps: 2N-1 possible alignments 199 in this example Gaps: 10**77 in this case See p. 188 in Waterman for an analysis of this issue. There must be k aligned pairs, 0 < k < min{n,m} There are n-choose-k ways to select the residues in n, and m-choose-k ways to select the residues in m So we have sum_{k>=0} (n-choose-k)(m-choosek) = (n+m choose k) possible alignments (for a local)

Dynamic Programming Algorithmic technique for optimization problems that have two properties: Optimal substructure: Optimal solution can be computed from optimal solutions to subproblems Overlapping subproblems: Subproblems overlap such that the total number of distinct subproblems to be solved is relatively small

Dynamic Programming Break problem into overlapping subproblems use memoization: remember solutions to subproblems that we have already seen 3 5 7 1 8 2 4 6

Fibonacci example 1,1,2,3,5,8,13,21,... fib(n) = fib(n - 2) + fib(n - 1) Could implement as a simple recursive function However, complexity of simple recursive function is exponential in n x, x - 1, x - 2, x - 3, x - 4,... 1, 1, 2, 3, 5 Draw out function call stack

Fibonacci dynamic programming Two approaches Memoization: Store results from previous calls of function in a table (top down approach) Solve subproblems from smallest to largest, storing results in table (bottom up approach) Both require evaluating all (n-1) subproblems only once: O(n)

Dynamic Programming Graphs Dynamic programming algorithms can be represented by a directed acyclic graph Each subproblem is a vertex Direct dependencies between subproblems are edges 1 2 3 4 5 6 graph for fib(6)

Why “Dynamic Programming”? Coined by Richard Bellman in 1950 while working at RAND Government officials were overseeing RAND, disliked research and mathematics “programming”: planning, decision making (optimization) “dynamic”: multistage, time varying “It was something not even a Congressman could object to. So I used it as an umbrella for my activities”

Pairwise Alignment Via Dynamic Programming first algorithm by Needleman & Wunsch, Journal of Molecular Biology, 1970 dynamic programming algorithm: determine best alignment of two sequences by determining best alignment of all prefixes of the sequences

Dynamic Programming Idea consider last step in computing alignment of AAAC with AGC three possible options; in each we’ll choose a different pairing for end of alignment, and add this to the best alignment of previous characters AAA C AG AAAC C AG - Write out scenarios more abstractly Consider last column X_i aligned to Y_j X_i gapped Y_j gapped AAA - AGC C consider best alignment of these prefixes score of aligning this pair +

DP Algorithm for Global Alignment with Linear Gap Penalty Subproblem: F(i,j) = score of best alignment of the length i prefix of x and the length j prefix of y. Note base cases

Dynamic Programming Implementation given an n-character sequence x, and an m-character sequence y construct an (n+1)  (m+1) matrix F F ( i, j ) = score of the best alignment of x[1…i ] with y[1…j ] A C G score of best alignment of AAA to AG

Initializing Matrix: Global Alignment with Linear Gap Penalty s 2s C G 3s 4s

DP Algorithm Sketch: Global Alignment initialize first row and column of matrix fill in rest of matrix from top to bottom, left to right for each F ( i, j ), save pointer(s) to cell(s) that resulted in best score F (m, n) holds the optimal alignment score; trace pointers back from F (m, n) to F (0, 0) to recover alignment

Global Alignment Example suppose we choose the following scoring scheme: +1 -1 s (penalty for aligning with a space) = -2

Global Alignment Example -2 -4 C G -6 -8 1 -1 -3 -5 one optimal alignment x: A A A C y: A G - C but there are three optimal alignments here (can you find them?)

DP Comments works for either DNA or protein sequences, although the substitution matrices used differ finds an optimal alignment the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this issue)

Equally Optimal Alignments many optimal alignments may exist for a given pair of sequences can use preference ordering over paths when doing traceback highroad 1 lowroad 3 2 2 3 1 highroad and lowroad alignments show the two most different optimal alignments

Highroad & Lowroad Alignments C x: y: A - G C highroad alignment -4 -6 -2 A -2 1 -1 -3 A -4 -1 -2 lowroad alignment A x: A A A C -6 -3 -2 -1 y: - A G C C -8 -5 -4 -1

Dynamic Programming Analysis recall, there are possible global alignments for 2 sequences of length n but the DP approach finds an optimal alignment efficiently Two strings of length=100 No gaps: 2N-1 possible alignments 199 in this example Gaps: 10**77 in this case See p. 188 in Waterman for an analysis of this issue.

Computational Complexity initialization: O(m), O(n) where sequence lengths are m, n filling in rest of matrix: O(mn) traceback: O(m + n) hence, if sequences have nearly same length, the computational complexity is

Local Alignment so far we have discussed global alignment, where we are looking for best match between sequences from one end to the other often, we will only want a local alignment, the best match between contiguous subsequences (substrings) of x and y

Local Alignment Motivation useful for comparing protein sequences that share a common motif (conserved pattern) or domain (independently folded unit) but differ elsewhere useful for comparing DNA sequences that share a similar motif but differ elsewhere useful for comparing protein sequences against genomic DNA sequences (long stretches of uncharacterized sequence) more sensitive when comparing highly diverged sequences

Local Alignment DP Algorithm original formulation: Smith & Waterman, Journal of Molecular Biology, 1981 interpretation of array values is somewhat different F (i, j) = score of the best alignment of a suffix of x[1…i ] and a suffix of y[1…j ]

Local Alignment DP Algorithm the recurrence relation is slightly different than for global algorithm Explain why zero: we can always choose a suffix of length zero from both strings

Local Alignment DP Algorithm initialization: first row and first column initialized with 0’s traceback: find maximum value of F(i, j); can be anywhere in matrix stop when we get to a cell with value 0

Local Alignment Example 1 2 3 x: y: G A Match: +1 Mismatch: -1 Space: -2

More On Gap Penalty Functions a gap of length k is more probable than k gaps of length 1 a gap may be due to a single mutational event that inserted/deleted a stretch of characters separated gaps are probably due to distinct mutational events a linear gap penalty function treats these cases the same it is more common to use gap penalty functions involving two terms a penalty g associated with opening a gap a smaller penalty s for extending the gap

Gap Penalty Functions linear affine

Dynamic Programming for the Affine Gap Penalty Case to do in time, need 3 matrices instead of 1 best score given that x[i] is aligned to y[j] best score given that x[i] is aligned to a gap best score given that y[j] is aligned to a gap

Why Three Matrices Are Needed consider aligning the sequences WFP and FW using g = -4, s = -1 and the following values from the BLOSUM-62 substitution matrix: S(F, W) = 1 S(W, W) = 11 S(F, F) = 6 S(W, P) = -4 S(F, P) = -4 the matrix shows the highest-scoring partial alignment for each pair of prefixes W F P -5 -6 -7 1 -4 6 2 No longer optimal substructure with previous subproblem formulization -WFP FW-- optimal alignment best alignment of these prefixes; to get optimal alignment, need to also remember WF FW -WF FW-

Global Alignment DP for the Affine Gap Penalty Case

Global Alignment DP for the Affine Gap Penalty Case initialization traceback start at largest of stop at any of note that pointers may traverse all three matrices

Global Alignment Example (Affine Gap Penalty) g = -3, s = -1 A C A C T M : 0 -∞ -∞ -∞ -∞ -∞ Ix : -3 -∞ -∞ -∞ -∞ -∞ Iy : -3 -4 -5 -6 -7 -8 A -∞ 1 -5 -4 -7 -8 -4 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -3 -4 -5 -6 A -∞ -3 -2 -5 -6 -5 -3 -9 -8 -11 -12 -∞ -∞ -7 -4 -5 -6 T -∞ -6 -4 -1 -3 -4 -6 -4 -4 -6 -9 -10 -∞ -∞ -10 -8 -5 -6

Global Alignment Example (Continued) -∞ -∞ -∞ -∞ -∞ Ix : -3 -∞ -∞ -∞ -∞ -∞ Iy : -3 -4 -5 -6 -7 -8 A -∞ 1 -5 -4 -7 -8 -4 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -3 -4 -5 -6 A -∞ -3 -2 -5 -6 -5 -3 -9 -8 -11 -12 -∞ -∞ -7 -4 -5 -6 T -∞ -6 -4 -1 -3 -4 -6 -4 -4 -6 -9 -10 -∞ -∞ -10 -8 -5 -6 ACACT AA--T ACACT A--AT ACACT --AAT three optimal alignments:

Local Alignment DP for the Affine Gap Penalty Case

Local Alignment DP for the Affine Gap Penalty Case initialization traceback start at largest stop at In his notes, Thomas had start at largest of M(I,j), I_x(I,j), I_y(I,j). This doesn’t seem right. It suggest we might get a higher score by ending an alignment with a gap than by not including that gap.

Gap Penalty Functions linear: affine: concave: a function for which the following holds for all k, l, m ≥ 0 e.g.

Concave Gap Penalty Functions 1 2 3 4 5 6 7 8 9 10

Computational Complexity and Gap Penalty Functions linear: affine: concave Concave result due to Eppstein general:  assuming two sequences of length n

Alignment (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column