Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Dynamic Programming: Sequence alignment
Outline The power of DNA Sequence Comparison The Change Problem
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Pairwise Sequence Alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Dynamic Programming: Edit Distance
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Heuristic alignment algorithms and cost matrices
Space Efficient Alignment Algorithms and Affine Gap Penalties
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Sequencing a genome and Basic Sequence Alignment
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequencing a genome and Basic Sequence Alignment
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Dynamic Programming: Edit Distance
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence comparison: Local alignment
Sequence Alignment Using Dynamic Programming
Intro to Alignment Algorithms: Global and Local
Sequence Alignment.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
Presentation transcript:

Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012

Why we compare sequences  Find sequential similarity between protein/DNA sequences –To infer functional similarity –To infer evolutionary history  Find important residues that are important for a protein’s function –Functional sites of a protein –DNA elements (e.g., transcription factor binding sites)

Comparison of sequences at various levels  We may look at sequences differently –Whole genome comparison (will be covered later) –Whole DNA/protein sequence –Protein domains –Motifs (protein motifs & motifs at DNA level)  For protein-coding genes, comparison at amino acid level instead of nucleotides to achieve higher sensitivity & specificity (20 letters versus 4 letters)

Protein domains AB C C D E B... A: all-β regulatory domain B: α/β-substrate binding domain C: α/β-nucleotide binding domain A B C Domain: structurally/functionally/evolutionally independent units Domain combination: two domains appearing in a protein

PROSITE patterns  Described by regular grammars  Programs that allow to search databases for PROSITE patterns (e.g., ScanProsite)ScanProsite  We have seen the ATP-binding motif, [AG]-x{4}-G-K- [ST]); another example [EDQH]-x-K-x-[DN]-G-x-R- [GACV]  Rules: –Each position is separated by a hyphen –One character denotes residuum at a given position –[…] denoted a set of allowed residues –(n) denotes repeat of n –(n,m) denoted repeat between n and m inclusive –Ex. ATP/GTP binding motive [SG]=X(4)-G-K-[DT]

Three principle methods of pairwise sequence alignment  Dot matrix analysis  The dynamic programming algorithm  Word or k-tuple methods, such as used by FASTA and BLAST.

The dot matrix method

Pairwise alignment (of biological molecules) ATCTGATG TGCATAC V W match deletion insertion mismatch indels matches mismatches insertions deletions ATCTGAT TGCATA v : w : m = 7 n = 6 Given 2 DNA sequences v and w:

A simple string comparison solution: Hamming distance  The most often used distance on strings in computer science: the number of positions at which the corresponding symbols are different  Hamming distance always compares i -th letter of v with i -th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task

Hamming distance is easy to compute, but… This makes some sense on comparing DNA sequences in some cases. But there are other mutations Substitution ACAGT  ACGGT Insertion/deletion (indel) ACAGT  ACGT Inversion ACA……GT  AG……ACT Translocation AC……AG…TAA  AG…TC……AAA Duplication We only consider the first two mutations for now. There are algorithms for the other mutations…

Comparing two strings: Edit distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w

Edit distance & Hamming distance V = ATATATAT W = TATATATA Hamming distance: Edit distance: d(v, w)=8 d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task W = TATATATA Just one shift Make it all line up V = - ATATATAT Hamming distance always compares i -th letter of v with i -th letter of w Edit distance may compare i -th letter of v with j -th letter of w How to find what j goes with what i ???

Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3 rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) Edit distance = 5 But how? Dynamic programming

Giving a population history  If we watched every generation, we could annotate the tree with exactly where mutations have happened. Generation 0: AGGATTA Generation 129: AGGATA Gen. 245: AGATA Current: CGATA Generation 172: AGGCCCATTA Gen. 295: GGCCCATTA Current: GGCCCATTA x x x x This would give a history to the current sequences. x Gen 280: CGATA

Edit distance v.s. ancestral reconstruction Edit distance simpler than ancestral reconstruction Orders of the edit operations do not matter. If two events overlap or even cancel each other in the evolution, they cannot be seen at edit distance. It is a distance metric. Identity: d(x,y)=0 iff x=y Symmetry: d(x,y) = d(y,x) Triangular Inequality: d(x,z) <= d(x,y) + d(y,z)

Alignment Hard to visually show the edit distance: E.g. C  insert Alignment is much nicer: ATGCA-TTTA ||| | || | ATGTACTT-A match =0, mismatch = -1, indel = -1. Score = the total score of each position of the alignment. Then computing edit distance is equivalent to finding the optimal (maximum scoring) alignment.

“Optimal” alignment The word “optimal” alignment is somewhat misleading. Ideally we want to find the “real” alignment of the sequences according to the real evolution instead. Here we try to find the “optimal” alignment. “optimal” solution is not necessary the correct solution. It all depends on how good the score function is. The identity scoring scheme is not a very accurate one. For example, transitions and transversions have the same score. Along this alignment topic, we will refine the score functions.

Scoring sequence alignment  How to score an alignment?  Simplest scoring scheme: 0 = match -1 = mismatch -1 = indel  This is called “linear gap penalty” because the cost of a gap (consecutive indels) is proportional to its length. (We could have each gap position cost g, for some negative constant g.)  Let’s see some examples

Given alignment, it is trivial to compute alignment score AATGCGA-TTTT || | ||| G-TG--ACTTTC 6 matches: 0 2 mismatches: -2 4 indels: -4edit distance (alignment score) = -6 AATG-CGATTTT || | || G-TGAC-TTTC- 5 matches: 0 3 mismatches: -3 4 indels: -4edit distance (alignment score) = -7

Alignment with DP  The question is how alignment can be computed with a computer?  Dynamic Programming –Requires the subsolution of an optimal solution is also optimal.

Every path in the edit graph corresponds to an alignment: Alignment as a path in the edit graph AT-GTTA-T ATCG-TAC-

Recursive definition

Dynamic programming algorithm S[0,0] = 0 S[i,0] = S[i-1,0] + g S[0,j] = S[0,j-1] + g for i from 1 to M for j from 1 to N S[i,j] = max{S[i-1,j-1]+s(x[i],y[j]), S[i-1,j]+g,S[i,j-1]+g} Output S[M,N]

Fill up the dynamic programming matrix A bottom-up calculation to get the optimal score (only!) seq[1]=PELICAN seq[2]=CWELACANTH DP Matrix # C W E L A C A N T H # P E L I C A N Scoring function: missmatch = -1 indel = -1

Traceback to get the actual alignment No need to physically record the green arrows Instead, we will trace back: following the red arrows! # C W E L A C A N T H # P E L I C A N CWELACANTH -PELICAN--

More formal backtracking Idea: We go from upper left to lower right. Backtrack the optimal path! Start in lower right: let i = m, j = n Until i = 0, j = 0: Figure out which of the three terms gave rise to M[i,j] by picking the largest. M[i-1,j]+indel, M[i,j-1]+indel, M[i-1,j-1]+f(s[i],t[j]) Move to the right place (reduce i, reduce j, or reduce both), and write down the configuration of the current column.

How similar biology and informatics are INFORMATICS B I O L O G Y

Space, time requirements The algorithm runs in O(nm) time: Each step requires only 3 checks to other points in the matrix. We also need O(nm) space, to store the matrix. If we only want to know the score of the optimal alignment, we can do that in O(min(m,n)) space. Reconstructing the alignment also requires only O(m+n) space.

Alignments are scored Need to score alignments. The alignment that has highest score may not be the one that actually matches evolutionary history. So you should never trust that an alignment must be right. It just optimizes the score. When we move to multiple alignments, things get worse: no guarantee of the optimal score, even.

A related problem: Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink * * * * * * * ** * * Source * Goal: Finding a longest path in G from “source” to “sink”

Longest Path in DAG Problem Goal: Find a longest path between two vertices in a weighted DAG Input: A weighted DAG G with source and sink vertices Output: A longest path in G from source to sink

“Edit distance problem” Runtime  It takes O(nm) time to fill in the dynamic programming matrix.  Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up the dynamic programming matrix.

 Reading: –Chapter 4 (Producing and Analyzing Sequence Alignments)  Next time we will talk about global and local pairwise sequence alignment, focus on –How alignment of biological sequences is different from comparison of two strings (scoring matrix + indel penalties) –Global versus local