Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Global Sequence Alignment by Dynamic Programming.
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Pairwise Sequence Alignment
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
 If Score(i, j) denotes best score to aligning A[1 : i] and B[1 : j] Score(i-1, j) + galign A[i] with GAP Score(i, j-1) + galign B[j] with GAP Score(i,
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Sequencing and Sequence Alignment
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
8/31/07BCB 444/544 F07 ISU Dobbs #6 - More DP: Global vs Local Alignment1 BCB 444/544 Lecture 6 Try to Finish Dynamic Programming Global & Local Alignment.
8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence Alignment We assume a link between the linear information stored in DNA, RNA or amino-acid sequence and the protein function determined by its.
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
#7 Still more DP, Scoring Matrices
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Presentation transcript:

Pairwise Sequence Alignment BMI/CS Mark Craven January 2002

Pairwise Alignment: Task Definition Given –a pair of sequences (DNA or protein) –a method for scoring the similarity of a pair of characters Do –determine the correspondences between substrings in the sequences such that the similarity score is maximized

Motivation comparing sequences to gain information about the structure/function of a query sequence putting together a set of sequenced fragments (fragment assembly) comparing a segment sequenced by two different labs

The Role of Homology homology: similarity due to descent from a common ancestor often we can infer homology from similarity thus we can sometimes infer structure/function from sequence similarity

Homology homologous sequences can be divided into two groups –orthologous sequences: sequences that differ because they are found in different species (e.g. human  -globin and mouse  -globin) –paralogous sequences: sequences that differ because of a gene duplication event (e.g. human  -globin and human  -globin, various versions of both )

Issues in Sequence Alignment the sequences we’re comparing probably differ in length there may be only a relatively small region in the sequences that matches we want to allow partial matches (i.e. some amino acid pairs are more substitutable than others) variable length regions may have been inserted/deleted from the common ancestral sequence

Gaps sequences may have diverged from a common ancestor through various types of mutations: –substitutions (ACGA AGGA) –insertions (ACGA ACCGA) –deletions (ACGA AGA) the latter two will result in gaps in alignments

Insertions/Deletions and Protein Structure loop structures: insertions/deletions here not so significant

Example Alignment  GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL H+ KV + +A ++ +L+ L+++H+ K NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG gaps depicted with – middle line shows matches –identical matches shown with letters –similar amino acids shown with + –dissimilar amino acids/gaps indicated by space

Alignments in the Olden Days: Dot Plots

Types of Alignment global: find best match of both sequences in their entirety local: find best subsequence match semi-global: find best match without penalizing gaps on the ends of the alignment

Pairwise Alignment Via Dynamic Programming Needleman & Wunsch, Journal of Molecular Biology, 1970 dynamic programming: solve an instance of a problem by taking advantage of computed solutions for smaller subparts of the problem determine alignment of two sequences by determining alignment of all prefixes of the sequences

Scoring Scheme Components substitution matrix –s(a,b) indicates score of aligning character a with character b gap penalty function –w(k) indicates cost of a gap of length k

Linear Gap Penalty Function different gap penalty functions require somewhat different DP algorithms the simplest case is when a linear gap function is used where g is a constant we’ll start by considering this case

Dynamic Programming Idea consider last step in computing alignment of AAAC with AGC three possible options; in each we’ll choose a different pairing for end of alignment, and add this to best alignment of previous characters AAA CAG CAAAC CAG -AAA -AGC C consider best alignment of these prefixes score of aligning this pair +

Dynamic Programming Idea given an n-character sequence x, and an m- character sequence y construct an (n+1) x (m+1) matrix F F [ i, j ] = score of the best alignment of x[1…i ] with y[1…j ]

Announcements next lecture: BLAST & PSI-BLAST read Altshul et al., Nucleic Acids Research, 1997 interested in an AI reading group for grad students? see

Dynamic Programming Idea F[i-1, j-1] F[i, j-1] F[i-1, j] F[i, j] + g + s(x[i],y[j])

Dynamic Programming Idea in extending an alignment, we have 3 choices: –align x[ 1… i-1] with y[ 1… j-1] and match x[ i ] with y[ i ] –align x[1… i ] with y[ 1… j-1 ] and match a gap with y[ j ] –align x[ 1…i-1 ] with y[ 1… j ] and match a gap with x[ i ] choose highest scoring choice to fill in F [ i, j ]

DP Algorithm for Global Alignment with Linear Gap Penalty one way to specify the DP is in terms of its recurrence relation:

Initializing Matrix: Global Alignment with Linear Gap Penalty A g A 2g2g CAG A 3g3g C 4g4g 0 3g3gg2g2g

DP Algorithm Sketch initialize first row and column of matrix fill in rest of matrix from top to bottom, left to right for each F [ i, j ], save pointer(s) to cell(s) that resulted in best score F [m, n] holds the optimal alignment score; trace pointers back from F [m, n] to F [0, 0] to recover alignment

DP Algorithm Example suppose we choose the following scoring scheme: s(x[i], y[j]) = +1 when x[i] = y[j] -1 when x[i] <> y[j] g (penalty for aligning with a gap) = -2

DP Algorithm Example A -2 A -4 CAG A -6 C x: y: A - A G A A C C one optimal alignment

DP Comments works for either DNA or protein sequences, although the substitution matrices used differ finds an optimal alignment the exact algorithm (and computational complexity) depends on gap penalty function (we’ll come back to this issue)

Equally Optimal Alignments many optimal alignments may exist for a given pair of sequences can use preference ordering over paths when doing traceback highroadlowroad highroad and loadroad alignments show the two most different optimal alignments

Highroad & Lowroad Alignments A -2 A -4 CAG A -6 C x: y: A G A A A - C C lowroad alignment x: y: A - A G A A C C highroad alignment

Dynamic Programming Analysis there are possible global alignments for 2 sequences of length n e.g. two sequences of length 1000 have possible alignments but the DP approach finds an optimal alignment efficiently

Computational Complexity initialization: O(m), O(n) filling in rest of matrix: O(mn) traceback: O(m + n) hence, if sequences have nearly same length, the computational complexity is

Local Alignment so far we have discussed global alignment, where we are looking for best match between sequences from one end to the other. more commonly, we will want a local alignment, the best match between subsequences of x and y.

Local Alignment Motivation useful for comparing protein sequences that share a common domain but differ elsewhere useful for comparing against genomic sequences (long stretches of uncharacterized sequence) more sensitive when comparing highly diverged sequences

Local Alignment DP Algorithm original formulation: Smith & Waterman, Journal of Molecular Biology, 1981 interpretation of array values is somewhat different –F [ i, j ] = score of the best alignment of a suffix of x[1…i ] and a suffix of y[1…j ]

Local Alignment DP Algorithm the recurrence relation is slightly different than for global algorithm

Local Alignment DP Algorithm initialization: first row and first column initialized with 0’s traceback: –find maximum value of F(i, j); can be anywhere in matrix –stop when we get to a cell with value 0

Local Alignment Example T T A A G G 0 A 0 A 0 A x: y: G G A A A A

More On Gap Penalty Functions a gap of length k is more probable than k gaps of length 1 –a gap may be due to a single mutational event that inserted/deleted a stretch of characters –separated gaps are probably due to distinct mutational events a linear gap penalty function treats these cases the same it is more common to use gap penalty functions involving two terms –a penalty h associated with opening a gap –a smaller penalty g for extending the gap

Gap Penalty Functions linear affine

Dyanamic Programming for the Affine Gap Penalty Case to do in time, need 3 matrices instead of 1 best score given that y[j] is aligned to a gap best score given that x[i] is aligned to a gap best score given that x[i] is aligned to y[j]

Global Alignment DP for the Affine Gap Penalty Case

initialization traceback –start at largest of –stop at any of

Local Alignment DP for the Affine Gap Penalty Case

initialization traceback –start at largest –stop at

Computational Complexity and Gap Penalty Functions linear: affine: general:

Alignment (Global) with General Gap Penalty Function consider every previous element in the row consider every previous element in the column