Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Searching Sequence Databases
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Where are we going? Remember the extended analogy? – Given binary code, what does the program do? – How does it work? At the end of the semester, I am.
Aki Hecht Seminar in Databases (236826) January 2009
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
BLAST What it does and what it means Steven Slater Adapted from pt.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Doug Raiford Phage class: introduction to sequence databases.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heuristic Alignment Algorithms Hongchao Li Jan
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Welcome to Introduction to Bioinformatics
Sequence comparison: Local alignment
Pairwise sequence Alignment.
Lecture 14 Algorithm Analysis
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Find the Best Alignment For These Two Sequences
Bioinformatics Algorithms and Data Structures
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College

The Problem We have two sequences that we want to compare, based on edit distance Edit distance = number of changes to get from one string to the other –Insertions –Deletions –Changes

Example LOVE => MONEY 1. Replace L by M 2. Replace V by N 3. Add Y at the end L O V E – M O N E Y

Brute Force Solution Try all possible alignments between the strings Looking at one string, –Every possible shift (space before or after) –Every possible gap (space within) –Gaps of various lengths, bounded by the size of the longest string

How many possibilities are there? Consider only single insertions: _ M _ O _ N _ E _ Y_ –There are N+1 places to insert, where N is the length of the string At each place you have 2 choices (insert or not) –Therefore, just this subset is already 2 N –So, brute force is exponential!

Dynamic Programming Score possibilities in an alignment matrix Value of any square in the matrix depends on: –Value above (if “vertical gap”) –Value beside (if “horizontal gap”) –Value diagonally above (if match or mismatch)

Global Alignment Matrix MONEY 0–– –– -2 –– -3 –– -4 –– -5 L | -1\ \ -2 –– -3 –– -4 –– -4 O | -2\ -2 \ 0 –– –– -2 –– -3 V | -3\ -3 | -1\ \ -2 –– -3 E | -4\ -4 | -2\ -2 \ 0 ––

Local Alignment Matrix MONEY L O 00\ V E 0000\ 1 0

Computing the Alignment Matrix For each square: –Take minimum of vertical gap, horizontal gap, (mis)match score : O(1) There are N*M squares, where N and M are the lengths of the strings Therefore, time and space are both O(N*M) or (for short) O(N 2 )

But, what is N? If we’re matching genomes, N is huge! N 2 is too much time and space! How can we save further?

Ordering the Computations Each cell can be computed when the ones above, diagonally above, and to the left are computed –Left-to-right, top to bottom (row major) –Top-to-bottom, left to right (column major) –Across a diagonal wavefront

Saving Space: Row Major A row major computation really only needs two rows (the one above, and the current row). After each computation, the current row becomes the row above Savings: space is O(N) instead of O(N 2 ) Cost: Insufficient information for traceback –Do a new alignment, limited to a region around the result.

Saving Time: Wavefront Use a parallel processor (effectively N machines at a time) Each reverse diagonal is computed at once Time is now O(N), but cost is N processors instead of 1 Computer science theoretician would say “no savings”, but if you’re the one waiting, you might disagree!

Saving More Time: Partial Search In local alignment, large areas have 0’s. Mismatches adjacent to 0’s are also 0’s. To get “reasonably large” values, you need longer sequences (BLAST “words”) in common So, only search near where there are common subsequences

Finding Common Subsequences Pick a sequence length. For each subsequence of that length, find all occurrences in each sequence If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed

BLAST’s Generalization Consider a threshold T and a sequence S The neighborhood of the sequence S is all sequences that score at or better than T against S BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore)

Consequences of Choices Higher T’s are faster, but ignore more potential matches Longer sequences are less common –Smaller neighborhoods for a given T –Fewer areas to search –More likelihood of missing good alignments

T vs Sequence Size Longer sequences have higher maximum scores (unless normalized) But, longer sequences (tend to) have more likelihood of mismatches?

Too Many Seeds If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times. Two-seed solution: –Only consider areas of the table that contain two seeds (diagonals) separated by a limited distance

Extending Alignments A seed region is a small alignment We want to “grow” the alignments (especially if we can connect to others(!)) To grow an alignment, use Smith- Waterman to compute neighboring values Question: when to stop growing?

Score Changes During Growing As an alignment is extended, its score changes –Score increases when sub-matches connect –Score decreases when extended into unrelated area Often score must decrease before increasing!

When to Stop? Consider current score, compared to maximum score so far When the current score gets sufficiently small relative to the maximum, then stop This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work)

One more “trick” Suppose that there is a “standard” sequence that many people want to align against Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations When someone does a search, the seeding part has already been done

Offline vs. Online Algorithms Offline Algorithms –Execute “standardized” part of algorithm in advance, and save result –This is like compilation of a program Online Algorithm –Use the tables or databases you built offline to answer a specific query –This is like running a program –User sees only time taken by Online Algorithm

Common Offline/Online Applications Web searching –Offline: build indexes of sites vs. keywords –Online: retrieve sites from the index Neural networks –Offline: train the network on many examples of the problem, set the weights –Online: run the network once (with fixed weights) on the specific example

Summary Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N 2 ) BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!)

Using BLAST Well Importance of setting parameters –Sequence length –Score threshold –Distance (for two-hit method) –Stopping condition (for growing seeded alignments)

Exercises Given the BLOSUM62 matrix at T/BLOSUM62.txt T/BLOSUM62.txt –What is the neighborhood of HID with threshold 5? 10? 15? Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)