Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College

The Problem We have two sequences that we want to compare, based on edit distance Edit distance = number of changes to get from one string to the other –Insertions –Deletions –Changes

Example LOVE => MONEY 1. Replace L by M 2. Replace V by N 3. Add Y at the end L O V E – M O N E Y

Brute Force Solution Try all possible alignments between the strings Looking at one string, –Every possible shift (space before or after) –Every possible gap (space within) –Gaps of various lengths, bounded by the size of the longest string

How many possibilities are there? Consider only single insertions: _ M _ O _ N _ E _ Y_ –There are N+1 places to insert, where N is the length of the string At each place you have 2 choices (insert or not) –Therefore, just this subset is already 2 N –So, brute force is exponential!

Dynamic Programming Score possibilities in an alignment matrix Value of any square in the matrix depends on: –Value above (if “vertical gap”) –Value beside (if “horizontal gap”) –Value diagonally above (if match or mismatch)

Global Alignment Matrix MONEY 0–– –– -2 –– -3 –– -4 –– -5 L | -1\ \ -2 –– -3 –– -4 –– -4 O | -2\ -2 \ 0 –– –– -2 –– -3 V | -3\ -3 | -1\ \ -2 –– -3 E | -4\ -4 | -2\ -2 \ 0 ––

Local Alignment Matrix MONEY 000000 L 000000 O 00\ 1 000 V 00 0000 E 0000\ 1 0

Computing the Alignment Matrix For each square: –Take minimum of vertical gap, horizontal gap, (mis)match score : O(1) There are N*M squares, where N and M are the lengths of the strings Therefore, time and space are both O(N*M) or (for short) O(N 2 )

But, what is N? If we’re matching genomes, N is huge! N 2 is too much time and space! How can we save further?

Ordering the Computations Each cell can be computed when the ones above, diagonally above, and to the left are computed –Left-to-right, top to bottom (row major) –Top-to-bottom, left to right (column major) –Across a diagonal wavefront

Saving Space: Row Major A row major computation really only needs two rows (the one above, and the current row). After each computation, the current row becomes the row above Savings: space is O(N) instead of O(N 2 ) Cost: Insufficient information for traceback –Do a new alignment, limited to a region around the result.

Saving Time: Wavefront Use a parallel processor (effectively N machines at a time) Each reverse diagonal is computed at once Time is now O(N), but cost is N processors instead of 1 Computer science theoretician would say “no savings”, but if you’re the one waiting, you might disagree!

Saving More Time: Partial Search In local alignment, large areas have 0’s. Mismatches adjacent to 0’s are also 0’s. To get “reasonably large” values, you need longer sequences (BLAST “words”) in common So, only search near where there are common subsequences

Finding Common Subsequences Pick a sequence length. For each subsequence of that length, find all occurrences in each sequence If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed

BLAST’s Generalization Consider a threshold T and a sequence S The neighborhood of the sequence S is all sequences that score at or better than T against S BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore)

Consequences of Choices Higher T’s are faster, but ignore more potential matches Longer sequences are less common –Smaller neighborhoods for a given T –Fewer areas to search –More likelihood of missing good alignments

T vs Sequence Size Longer sequences have higher maximum scores (unless normalized) But, longer sequences (tend to) have more likelihood of mismatches?

Too Many Seeds If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times. Two-seed solution: –Only consider areas of the table that contain two seeds (diagonals) separated by a limited distance

Extending Alignments A seed region is a small alignment We want to “grow” the alignments (especially if we can connect to others(!)) To grow an alignment, use Smith- Waterman to compute neighboring values Question: when to stop growing?

Score Changes During Growing As an alignment is extended, its score changes –Score increases when sub-matches connect –Score decreases when extended into unrelated area Often score must decrease before increasing!

When to Stop? Consider current score, compared to maximum score so far When the current score gets sufficiently small relative to the maximum, then stop This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work)

One more “trick” Suppose that there is a “standard” sequence that many people want to align against Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations When someone does a search, the seeding part has already been done

Offline vs. Online Algorithms Offline Algorithms –Execute “standardized” part of algorithm in advance, and save result –This is like compilation of a program Online Algorithm –Use the tables or databases you built offline to answer a specific query –This is like running a program –User sees only time taken by Online Algorithm

Common Offline/Online Applications Web searching –Offline: build indexes of sites vs. keywords –Online: retrieve sites from the index Neural networks –Offline: train the network on many examples of the problem, set the weights –Online: run the network once (with fixed weights) on the specific example

Summary Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N 2 ) BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!)

Using BLAST Well Importance of setting parameters –Sequence length –Score threshold –Distance (for two-hit method) –Stopping condition (for growing seeded alignments)

Exercises Given the BLOSUM62 matrix at http://www.ncbi.nlm.nih.gov/Class/BLAS T/BLOSUM62.txt http://www.ncbi.nlm.nih.gov/Class/BLAS T/BLOSUM62.txt –What is the neighborhood of HID with threshold 5? 10? 15? Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Similar presentations

Presentation on theme: "Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Similar presentations

Presentation on theme: "Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College."— Presentation transcript:

Similar presentations

About project

Feedback