# Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

## Presentation on theme: "Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity."— Presentation transcript:

Sequence Alignment Bioinformatics

Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity Edit distance (transforming S to T) Scoring mechanism Related Problem: Given a target sequence, obtain sequences in a database that are similar to the target

Edit Distance Sequences S and T are strings over an alphabet (e.g.,{a,c,t,g}) Edit operations (indels) Insertion of a character Deletion of a character Example: need 3 indels to transform attc to tttac

Alignment We can model edit distance by aligning the two strings: -att-c t-ttac An alignment of strings S and T is described by two strings S’ and T’ of the same length such that S’ (T’) contains the characters of S (T) in order interspersed with spaces (-) No position exists that contain spaces for both S’ and T’

Gaps, Matches, and Mismatches When comparing characters that occur in the same positions in S’ and T’, four possibilities arise - in S’ -> insertion (gap) - in T’ -> deletion (gap) Characters match -> match Characters don’t match -> mismatch Can assign weights to each possibility (usually a positive number for matches, a negative number for gaps and mismatches)

Scoring and Optimal Alignments Given strings S and T, and an alignment (S’,T’), a score can be computed based on pre-established weights for gaps, matches, and mismatches Add all the weights for each position in S’ and T’ Note that there are many possible alignments for S and T An optimal alignment for S and T is the alignment that yields the maximum score

Problem Formulations for Sequence Comparison Original Formulation: Given two sequences S & T, are S and T similar? Revised Formulation: Given two sequences S & T, and weights for matches, gaps, and mismatches, determine the score of an optimal alignment of S & T

Brute-force Algorithm Compare(S, T) generate all possible alignments for S and T for each alignment determine score return maximum score Note: This is an exponential algorithm due to the number of possible alignments for S and T

An Edit Graph TGCATA A T C T G A T

Edit Graphs are Alignments Path from upper left corner to lower right corner represents an alignment Vertical arrow: gap (deletion) Horizontal arrow: gap (insertion) Diagonal: match or mismatch Alignment: AT-C-TGAT -TGCAT-A- Score: (assume 5 for match, -2 for mismatch) –2+5+-2+5+-2+5+-2+5+-2 = 10

Entries in an Edit Graph Strategy: Fill up the intersections (green circles) with (running) scores based on the path traversed so far Each circle can be computed according to results of at most three other values a c b x a + match/mismatch weight X = either b + gap weight c + gap weight

Dynamic Programming Algorithm Start with upper left corner (score 0) Fill up top row and and leftmost column Fill up succeeding rows using the formula Resulting value on the lower right corner is the optimal score a + match/mismatch weight X = Max b + gap weight c + gap weight

Algorithm Analysis Let N be the lengths of S and T Need to compute (N+1)(N+1) entries O(N 2 ) algorithm

Determining the Actual Alignment Need to remember which contributed to the computation of an entry (which resulting value was the maximum) Perform a back-trace from lower right corner back to the upper left corner Multiple optimal alignments possible because of ties

Other Complexity Issues When performing a search on a database, time complexity is dependent on the size D of the database since you run the algorithm on each sequence in the database: O(DN 2 ) Space requirement: an (N+1)(N+1) table Can improve to 4N if we fill up the table according by “inverted Ls”. Topmost row and leftmost column first, then go by inner row and column, one stage at a time

Variations Scoring mechanism is driven by the weights for gaps, matches and mismatches Can have different weights for starting a gap versus extending a gap (e.g., blastp and blastn) Can have a table that allows different match/mismatch scores (e.g., BLOSUM)

Download ppt "Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity."

Similar presentations