Pairwise Sequence Alignment

Slides:



Advertisements
Similar presentations
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Sequence comparison: Introduction and motivation Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Sequence Similarity Searching Class 4 March 2010.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Multiple Sequence Alignment
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Local alignment
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Developing Pairwise Sequence Alignment Algorithms
Presented by Liu Qi Pairwise Sequence Alignment. Presented By Liu Qi Why align sequences? Functional predictions based on identifying homologues. Assumes:
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise sequence comparison
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
GENOME 559: Introduction to statistical and computational genomics
Sequence Alignment 11/24/2018.
Sequence comparison: Dynamic programming
Pairwise sequence Alignment.
Pairwise Sequence Alignment
Sequence comparison: Local alignment
Sequence comparison: Traceback
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Sequence comparison: Significance of similarity scores
Sequence comparison: Introduction and motivation
Presentation transcript:

Pairwise Sequence Alignment H.C.Huang @ 2005/9/22 2005 Autumn / YM / Bioinformatics

Outline Motivation A simple alignment problem Scoring alignments Substition matrices Gap penalties Dynamic programming Global vs. local alignment

Reference D.W. Mount / Bioinformatics / Ch.3 Slides from Prof. C.H.Chang UW / Genomic Informatics / W.S. Noble

Motivation Why align two protein or DNA sequences? Determine whether they are descended from a common ancestor (homologous). Infer a common function. Locate functional elements (motifs or domains). Infer protein structure, if the structure of one of the sequences is known.

Example

Brief History Robinson (1938) described the 1st algorithm for Longest Common Subsequence problem Levenshtein (1966) introduced “edit distance” between strings Min. # of edit operations needed to transform from one string to another No algorithm described Gibbs & McIntyre (1970): dot matrix analysis Needleman & Wunsch (1970) “re-discovered” the algorithm for sequence alignment using DP Smith & Waterman (1981) proposed a clever modification of DP for local alignment

Brief History Knuth (1973) suggested a filtering idea for pattern matching 高德纳 (Donald Knuth) Dumas & Ninio (1982) first applied the filtering idea for fast sequence comparison Lipman & Pearson (1985): FASTA Altschul et al. (1990): BLAST

Pairwise Alignment Methods Dotplot analysis method Dynamic programming (DP) algorithm Word / k-tuple methods

How dotplot analysis works

How dotplot analysis works Direct Repeat Reversed Repeat Aligned sequence

To reduce random noise in dotplots Specify a window size, w Examine w residues at a time Count how many matches are among the w pairs of residues

Dotlet http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC.

Scoring alignments GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- We need a way to measure the quality of a candidate alignment. Alignment scores consist of two parts: a substition matrix, and a gap penalty.

Scoring aligned bases Purine A G Pyrimidine C T Transversion (low score) Transition (high score)

Scoring aligned bases GAATC CATAC Purine A G Pyrimidine C T A C G T 10 Transversion Transition A hypothetical substitution matrix: GAATC CATAC A C G T 10 -5 -5 + 10 + -5 + -5 + 10 = 5

Scoring aligned bases GAAT-C CA-TAC Purine A G Pyrimidine C T A C G T Transversion (expensive) Transition (cheap) A hypothetical substitution matrix: GAAT-C CA-TAC A C G T 10 -5 -5 + 10 + ? + 10 + ? + 10 = ?

Scoring gaps Linear gap penalty: every gap receives a score of d. Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e. GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5

A simple alignment problem. Problem: find the best pairwise alignment of GAATC and CATAC. Use a linear gap penalty of -4. Use the following substitution matrix: A C G T 10 -5

How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length N exist?

How many possibilities? GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C How many different alignments of two sequences of length N exist? Too many to enumerate!

-G- CAT DP matrix G A T C The value in position (i,j) is the score of the best alignment of the first i positions of the first sequence versus the first j positions of the second sequence. -8

DP matrix -8 G A T C -12 -G-A CAT- Moving horizontally in the matrix introduces a gap in the sequence along the left edge.

DP matrix -8 G A T C -12 -G-- CATA Moving vertically in the matrix introduces a gap in the sequence along the top edge.

Initialization G A T C

G - Introducing a gap G A T C -4

- C DP matrix G A T C -4

DP matrix G A T C -4 -8

G C DP matrix G A T C -4 -5

----- CATAC DP matrix G A T C -4 -8 -12 -16 -20 -5

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

DP matrix G A T C -4 -8 -12 -16 -20 -5 -G CA G- CA --G CA- -4 -9 -12 -4 -8 -12 -16 -20 -5 -4 -4

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

DP matrix G A T C -4 -8 -12 -16 -20 -5

DP matrix G A T C -4 -8 -12 -16 -20 -5 ?

What is the alignment associated with this entry? DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 What is the alignment associated with this entry?

DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 -G-A CATA

Find the optimal alignment, and its score. DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 5 1 2 -2 ? Find the optimal alignment, and its score.

DP matrix G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

DP matrix GA-ATC CATA-C G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

DP matrix GAAT-C CA-TAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

DP matrix GAAT-C C-ATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

DP matrix GAAT-C -CATAC G A T C -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -4 -8 -12 -16 -20 -5 -9 -13 -6 5 1 -3 -7 11 7 2 6 -2 17

Multiple solutions GA-ATC CATA-C When a program returns a sequence alignment, it may not be the only best alignment. GAAT-C CA-TAC GAAT-C C-ATAC GAAT-C -CATAC

DP in equation form Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

DP in equation form

Dynamic programming Yes, it’s a weird name. DP is closely related to recursion and to mathematical induction. We can prove that the resulting score is optimal.

Dynamic Programming A method for solving recursive problem Break a problem into smaller subproblems Solve subproblems optimally, recursively Use these optimal solutions to construct an optimal solution for the original problem

Summary Scoring a pairwise alignment requires a substition matrix and gap penalties. Dynamic programming is an efficient algorithm for finding the optimal alignment. Entry (i,j) in the DP matrix stores the score of the best-scoring alignment up to those positions. DP iteratively fills in the matrix using a simple mathematical rule.

A simple exercise A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

A simple exercise A G C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

A simple exercise A G -5 -10 -15 C Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G -5 -10 -15 C

A simple exercise A G -5 -10 -15 2 -3 -8 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G -5 -10 -15 2 -3 -8 -1 C -6

Traceback Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence.

A simple example A G -5 2 -3 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A G -5 2 -3 -1 C -6

A simple example A G -5 2 -3 -1 C -6 Find the optimal alignment of AAG and AGC. Use a gap penalty of d=-5. Start from the lower right corner and trace back to the upper left. Each arrow introduces one character at the end of each aligned sequence. A horizontal move puts a gap in the left sequence. A vertical move puts a gap in the top sequence. A diagonal move uses one character from each sequence. A G -5 2 -3 -1 C -6 AAG- AAG- -AGC A-GC

Local alignment A single-domain protein may be homologous to a region within a multi-domain protein. Usually, an alignment that spans the complete length of both sequences is not required.

BLAST allows local alignments Global alignment Local alignment

Global alignment DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

Local alignment DP Align sequence x and y. F is the DP matrix; s is the substitution matrix; d is the linear gap penalty.

Local DP in equation form

Local alignment Two differences with respect to global alignment: No score is negative. Traceback begins at the highest score in the matrix and continues until you reach 0. Global alignment algorithm: Needleman-Wunsch. Local alignment algorithm: Smith-Waterman.

A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 C

A simple example Find the optimal local alignment of AAG and AGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 C AG

Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G C

Local alignment Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 1 C

Substitution matrices Find the optimal local alignment of AAG and GAAGGC. Use a gap penalty of d=-5. A C G T 2 -7 -5 A G 2 4 6 1 C Where did this substitution matrix come from?

Nucleic acid PAM matrices PAM = point accepted mutation 1 PAM = 1% probability of mutation at each sequence position. A uniform PAM1 matrix: A G T C 0.99 0.00333

Transitions and transversions Transitions (A  G or C  T) are more likely than transversions (A  T or G  C) Assume that transitions are three times as likely: A G T C 0.99 0.006 0.002

Distant relatives If the probability of a substitution is 2%, simply multiply the probabilities from 1% by themselves. A PAM N matrix is computed by raising PAM 1 to the Nth power. A G T C 0.98014 0.011888 0.003984

Better gap scoring Real gaps are often more than one letter long. >gi|729942|sp|P40601|LIP1_PHOLU Lipase 1 precursor (Triacylglycerol lipase) Length = 645 Score = 33.5 bits (75), Expect = 5.9 Identities = 32/180 (17%), Positives = 70/180 (38%), Gaps = 9/180 (5%) Query: 2038 IYSLYGLYNVPYENLFVEAIASYSDNKIRSKSRRVIATTLETVGYQTANGKYKSESYTGQ 2097 +++ YGL+ Y+ ++ Y D K +R ++ + N + G+ Sbjct: 441 VFTAYGLWRY-YDKGWISGDLHYLDMKYEDITRGIVLNDW----LRKENASTSGHQWGGR 495 Query: 2098 LMAGYTYMMPENINLTPLAGLRYSTIKDKGYKETGTTYQNLTVKGKNYNTFDGLLGAKVS 2157 + AG+ + + +P+ + KGY+E+G + + Y++ G LG ++ Sbjct: 496 ITAGWDIPLTSAVTTSPIIQYAWDKSYVKGYRESGNNSTAMHFGEQRYDSQVGTLGWRLD 555 Query: 2158 SNINVNEIVLTPELYAMVDYAFKNKVSAIDARLQGMTAPLPTNSFKQSKTSFDVGVGVTA 2217 +N P ++ F +K I + + + S KQ + +G+ A Sbjct: 556 TNFG----YFNPYAEVRFNHQFGDKRYQIRSAINSTQTSFVSESQKQDTHWREYTIGMNA 611 Real gaps are often more than one letter long.

Affine gap penalty LETVGY W----L -5 -1 -1 -1 -5 -1 -1 -1 Separate penalties for gap opening and gap extension. This requires modifying the DP algorithm to store three values in each box.

Summary Local alignment finds the best match between subsequences. Smith-Waterman local alignment algorithm: No score is negative. Trace back from the largest score in the matrix. Substitution matrices represent the probability of mutations. Affine gap penalties include a large gap opening penalty and small gap extension penalty.