# Heuristic alignment algorithms and cost matrices

## Presentation on theme: "Heuristic alignment algorithms and cost matrices"— Presentation transcript:

Heuristic alignment algorithms and cost matrices
Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Overview chapter 2 What sorts of alignment should be considered? The scoring system used to rank alignments. The algorithm used to find optimal (or good) scoring alignments. The statistical methods used to evaluate the significance of an alignment score. local, global, repeated matches, overlap matches Scoring model to determine score of an alignment, to get optimal one Needleman-Wunsch algorithm, Smith-Waterman algorithm Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Overview chapter 2 What sorts of alignment should be considered? The scoring system used to rank alignments. The algorithm used to find optimal (or good) scoring alignments. The statistical methods used to evaluate the significance of an alignment score. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Contents Heuristic alignment algorithms BLAST FASTA Linear space methods Significance of scores Bayesian approach Classical approach Deriving score parameters PAM matrices BLOSUM Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Contents Heuristic alignment algorithms BLAST FASTA Linear space methods Significance of scores Bayesian approach Classical approach Deriving score parameters PAM matrices BLOSUM complexity of the multiplication of the length of the two sequences under study One thousend sequence length The goal of these methods is to search as small a fraction as possible of the cells in the dynamic programming matrix, while still looking at all the high scoring alignments. The time is close to linear in time Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
The term heuristic A heuristic algorithm is based on empirical information that has no explicit rationalization. It does not necessarily return the exact answer to the problem under study, but is faster than the algorithm that does and is still very usable. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
BLAST Basic Linear Alignment Search Tool. Simplification of the Smith-Waterman algorithm. Uses subsequences of the query sequence to make ‘neighbourhood words’ using a threshold. When a neighbourhood word matches a subsequence in the database a ‘hit extension’ process is started. The idea behind the BLAST algorithm is that true match alignments are very likely to contain somewhere within them a short stretch of identities, or very high scoring matches. Fixed length by default 3 for protein, and 11 for nucleic acids To extend the possible match as an ungapped alignment in both directions, stopping at the maximum scoring extension UNGAPPED The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. Divide the query and database sequences into overlapping words and identify those words that give a high score that can be obtained when compared. For proteins, the default word size is 3. A hash table is used, but it includes conserved replacements. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Example Query sequence: q l n f All subsequences: q l, l n, n f Creating neighbourhood words: q l  q l, q m, h l, z l l n  l n, l b n f  n f, a f, n y, d f, q f, e f, g f, h f, k f, s f, t f, b f, z f Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
FASTA FAST Alignment. Fast approximation of the Smith-Waterman algorithm. Step 1: Exact short word matches with length ktup Step 2: extend to ungapped alignments Step 3: identify gapped alignments Step 4: dynamic programming restricted to a subregion As each sequence is read from the database, divide it into overlapping words with the same ktup number as the query sequence. Compare the words from the database sequence and the query sequence and compute an initial score based on the number of identities concentrated within small sequential regions. This can be done faster with a hash or look-up table. Save the 10 regions with the highest initial score. Re-score these regions with the PAM250 matrix. Trim the ends of the region if it makes for a higher score.  The resulting score is called the init1 score. Try to join some of the (ungapped) regions found in Step 2 to make a better score even with gap penalties applied. The score is called the initn score. Use the full dynamic programming algorithm on a narrow band (32 residues wide) around the best matches from Step 3. The score is called the opt score. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
BLAST versus FASTA They both use the same extension method. They both can be used for both DNA and proteins. BLAST is faster than FASTA. BLAST is more sensitive than FASTA on proteins. BLAST is less sensitive than FASTA for nucleic acid sequences. BLAST uses neighbourhood words, FASTA does not. BLAST is mainly for ungapped alignment, FASTA for gapped alignments. BLAST is more sensitive than FASTA on proteins. FASTA is more sensitive than BLAST for nucleic acid sequences. Why? Because FASTA accepts only perfect matches. BLAST accepts conserved replacements with protein sequences. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
BLAST vs. FASTA, example Consider the sequences: n f l and n y l ktup = 2 (remember: only for FASTA) Even though FASTA only needs a matching word of size 2 it does not find a match. BLAST does find a match (of word size 3 even) on account of neighbourhood words. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Demo at HLA class I histocompatibility antigen, A-1 alpha chain precursor Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Contents Heuristic alignment algorithms BLAST FASTA Linear space methods Significance of scores Bayesian approach Classical approach Deriving score parameters PAM matrices BLOSUM Another computational resource that can limit dynamic programming alignment is memory usage. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Reducing memory usage Score matrices so far are of size nm (with n and m the sequence lengths). We can reduce memory usage to n+m. Cost: time is doubled. This is done by linear space methods. Matrices which have overall size nm. But if one or both of the sequences is a DNA sequence tens or hundreds of thousands of bases long Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Divide and conquer We find a cell (u,v) in the middle column that is on the optimal path. This cell divides the matrix in four parts of which two are important for the path. This is done recursively to these two parts. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Contents Heuristic alignment algorithms BLAST FASTA Linear space methods Significance of scores Bayesian approach Classical approach Deriving score parameters PAM matrices BLOSUM Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Short review Letter a occurs independently with frequency qa in the random model. Aligned pairs of residues occur with a joint probability pab in the match model. Random model: P(x,y|R) = ΠkqxkΠlqyl Match model: P(x,y|M) = Πkpxkyk Given a pair of aligned sequences, we want to assign a score to the alignment that gives a measure of the relative likelihood that the sequences are related as opposed to being unrelated. We do this by having models that assign a probability to the alignment in each of the two cases; we then consider the ratio of the two probabilities. Linda Muselaars and Miranda Stobbe

Bayesian approach: model comparison
Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Comparison For global matches compare with 0 to determine whether the alignment is significant. When setting the prior odds ratio in inverse proportion to the size of the database N, compare with log N. For local matches compare with 0.1 • log(nm) Linda Muselaars and Miranda Stobbe

Extreme value distribution
Scores of a sequence aligned to a set of random sequences obey EVD. We compute the probability that the best match of unrelated sequences has score greater than our maximal score. We assume a null model (unrelated) and compute the probabilty that the match score with random sequences is higher than the observed score. Sum of many similar random variables -> Normal distribution, but: Comparing a query sequence to a set of uniform length random sequences yields scores that obey not a normal but an extreme value distribution. The tail of this distribution is fatter, so assuming normality tends grossly to exaggerate an alignment’s significance. What woudld happen to the significance if we used the normal distribution for comparison? Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Other alignments For local ungapped alignments we have a different EVD than for fixed ungapped alignment (because we have more possible starting points). For gapped alignments empirically established distributions are used. Local ungapped: Number of random sequences with match score greater than observed value is Poisson distributed. We use the mean of this distribution to compute the appropiate EVD and follow the same principle. Gapped: not analytically derived! Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Correcting for length When database sequences are longer, we have higher scores. Solutions: Subtract log (mi) for length mi of the database sequence. Bin all the database entries by length and fit a linear function. Possibly higher scores because there are more possible starting points. The linear function is fit through the means of the bins. This solution is better than the first and is easily done when the databases are large. Linda Muselaars and Miranda Stobbe

Notes on test statistic
Search statistic is the same as the test statistic. Advantage: both have highly discriminative power. Disadvantage: introduction of bias in test phase. We might want to use uncorrelated sequences, but: Search statistic is good because it returns the alignment with the highest score. Test statistic is good because it can compute the significance quite reliably. There is bias though, compare with writing a programma and testing it yourself. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Contents Heuristic alignment algorithms BLAST FASTA Linear space methods Significance of scores Bayesian approach Classical approach Deriving score parameters PAM matrices BLOSUM Linda Muselaars and Miranda Stobbe

Substitution and gap scores
Letter a occurs independently with frequency qa in the random model. Aligned pairs of residues occur with a joint probability pab in the match model. f(g) is a function of the length of the gap The question now is how do we estimate the probabilities we need to create the score matrices. This can be done in different ways. Linda Muselaars and Miranda Stobbe

Estimating probabilities
Simple approach: set the probabilities to normalised frequencies (assessed by counting frequences in confirmed alignments). But: It is difficult to obtain a good random sample. Does not take into account different ‘distances’ to the common ancestor. Frequencies of aligned residue pairs and of gaps are counted in confirmed alignments. Protein sequences come in families so alignments tend not to be independent from each other. We need a way of estimating probabilities which takes into account evolutionary distance. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
PAM matrices Percentage of Acceptable point Mutations per 108 years matrices. Amino acid substitution matrices. Obtain substitution data from alignments and estimate probabilities for longer evolutionary distances. A PAMn: n accepted mutations event per 100 amino acids. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
PAM matrices (2) Construct phylogenetic trees relating the sequences in 71 families (at least 85% similar). Count the number of amino acid changes with respect to immediate ancestor. 20 x 20 amino acid substitution matrix computed. Expected number of substitutions is 1% in PAM1. PAMn = (PAM1)n. PAM-matrix converted to a log-odds matrix. Phylogenetic: the changes a sequence goes through during time. Each pair of sequences differs by no more than 15% of their residues. a->b is counted the same as b->a. Relative mutability (how much change there is) is assessed. Log-odds: take scores and divide by frequency of amino-acid. PAM250 is most widely used. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Drawbacks Using the matrix for short time intervals to compute the ones for longer time intervals does not capture the true difference. Takes into account only single base changes instead of all types of codon changes. Databases containing alignments of more distantly related proteins are used to derive matrix scores more directly and accurately. Short: amino acid substitutions that arise from single base changes. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
BLOCKS database Used to derive BLOSUM matrices. Sequences are clustered according to percentage of identical residues. Aab then is the frequency of observing a in one cluster aligned to b in another cluster. Size of the clusters needs to be corrected for. A set of aligned ungapped regions from proteins. Size needs to be corrected for to prevent bias. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
BLOSUM BLOcks SUbstitution Matrix BLOSUMn is the matrix where two sequences are put into one cluster when more then n% of their residues are identical (lower n corresponds to longer evolutionary time). From Aab qa and pab are estimated, which are used to compute the scores for the matrix. Qa is the fraction of pairing that include an a. Pab is the fraction of pairings between a and b out of all observed pairings. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
PAM versus BLOSUM Based on global alignments. PAM1 is the matrix calculated from comparisons of substitutions in unit time. Other PAM matrices are extrapolated from PAM1. Based on local alignments. BLOSUMn is a matrix calculated from sequences with no less than n% divergence. All matrices are based on observed alignments. WHat is another difference? The higher the number in PAM the more distantly related, in BLOSUM it is the other way around. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Gap penalties Time-dependent: Number of gaps increases (gap-open score d linear in log t). Length distribution constant (gap-extend score e remains constant). In practice people choose gap costs empirically (only two parameters). As gaps become more likely we could reduce the pairwise scores. There is no standard-set for time-dependent gap models so these are reasonable assumptions. Last point: this is a very small correction and not usually made. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Notes Objective was to determine whether two sequences are related. Scoring schemes and statistics to determine the significance of a match. Even so, it is not always possible to distinguish between two related sequences or two sequences that seem to be related, but are not. Linda Muselaars and Miranda Stobbe

Linda Muselaars and Miranda Stobbe
Summary BLAST and FASTA packages are used to reduce the time used for finding alignments. Linear space alignments can be used to reduce memory usage. We need the significance of scores for the importance of a match. We can use the score parameters stated in PAM and BLOSUM matrices. Linda Muselaars and Miranda Stobbe