Basic idea for linear space Cells don’t directly depend on cells more than one column to the left. So keep only two columns; forget the rest No back-pointers! How to find the alignment?
If we happen to know a cell on the optimal alignment… Divide and conquer
We could then repeat this trick! Divide and conquer
But how to find such a point? Divide and conquer
Determine where the optimal alignment crossed a certain column: at every cell, remember at which row it crossed the column. Modified DP
Always only two columns at a time. Clearly O( n+m ) space. But what about the time? Space analysis
Using linear space DP at every step of a divide and conquer scheme. What are we doing?
We do more work now, but how much? Look at case with two identical strings. n 2 Time analysis n2n2
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 Time analysis ¼ n 2
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 Time analysis 1/16 n 2
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Time analysis et cetera…
We do more work now, but how much? Look at case with two identical strings. n 2 + ½ n 2 + ¼ n 2 + … < 2n 2 Along the same lines, the algorithm in general is still O( nm ). And actually only about twice as much work! Time analysis et cetera…
Searching a database Database of strings Query string Select best matching string(s) from the database.
Algorithms we already know Until now, the algorithms were exact and correct (For the model!) But for this purpose, too slow: Say 100M strings with 1000 chars each Say 10M cells per second Leads to ~3 hours search time
Getting faster Full DP is too expensive Space issue: fixed. With guaranteed same result Now the time issue
BLAST and FASTA Heuristics to prevent doing full DP - Not guaranteed same result. - But tend to work well. Idea: First spend some time to analyze strings Don’t calculate all DP cells; Only some, based on analysis.
Basic idea In a good alignment, it is likely that several short parts match exactly. A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A
k-tuples Decompose strings into k-tuples with corresponding offset. E.g. with k=3, “ A C C A B B” becomes 0: A C C 1: C C A 2: C A B 3: A B B Do this for the database and the query
Example strings A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C B A A B B A D A C C B B C C D C D A A C C A B B D B C D C B B C … A B B A D A C C B B C C D C D A
Matches / hits Lots on the same diagonal: might be a good alignment. Offset in query Offset in db string
Do e.g. “banded DP” around diagonals with many matches Don’t do full DP Offset in query Offset in db string
Some options If no diagonal with multiple matches, don’t DP at all. Don’t just allow exact ktup matches, but generate ‘neighborhood’ for query tuples. …
Personal experience “Database architecture” practical assignment MonetDB: main memory DBMS SWISS-PROT decomposed into 3-tuples 150k strings 150M 3-tuples Find database strings with more than one match on the same diagonal.
Personal experience 43 char query string ~500k matches (in ~122k strings) ~32k diagonals with more than one match in ~25k strings With some implementation effort: ~1s (Kudos for Monet here!)
Personal experience From 150k strings to 15k ‘probable’ strings in 1 second. This discards 90% percent of database for almost no work. And even gives extra information to speed up subsequent calculations. … but might discard otherwise good matches.
Personal experience Tiny query 6 char query 122 diagonals in 119 strings in no time at all An actual protein from the database: 250 char query ~285k diagonals in ~99k strings in about 5 seconds
BLAST/FASTA conclusion - Not guaranteed same result. - But tend to work well.
Statistical significance What do the scores mean? Score parameters (PAM, BLOSUM)
What do the scores mean? We are calculating ‘optimal’ scores, but what do they mean? Used log-odds to get an additive scoring scheme Biologically meaningful versus Just the best alignment between random strings 1.Bayesian approach 2.Classical approach
Bayesian approach Interested in: Probability of a match, given the strings P( M | x,y ) Already know: Probability of strings given the models, i.e. P( x,y | M ) and P( x,y | R ) So … Bayes rule.
Bayesian approach Bayes rule gives us: P( x,y | M ) P( M ) P( M | x,y ) = P( x,y ) …rewrite… …rewrite… …rewrite…
Bayesian approach P( M | x,y ) = σ( S’ ) S’ = log( P(x,y|M)/P(x,y|R) ) + log( P(M)/P(R) ) Our score!
Take care! Requires that substitution matrix contains probabilities. Mind the prior probabilities: when matching against a database, subtract a log(N) term. Alignment score is for the optimal alignment between the strings; ignores possible other good alignments.
Classical approach Call the maximum score among N random strings: M N. P( M N < x ) means “Probability that the best match from a search of a large number N of unrelated sequences has score lower than x.” is an Extreme Value Distribution Consider x = our score. Then if this probability is very large: likely that our match was not just random.
Correcting for length Additive score So longer strings have higher scores!
Correcting for length If match with any string should be equally likely Correct for this bias by subtracting log(length) from the score Because score is log-odds, this `is a division’ that normalizes score
Scoring parameters How to get the values in cost matrices? 1.Just count frequencies? 2.PAM 3.BLOSUM
Just count frequencies? Would be maximum likelihood estimate - Need lots of confirmed alignments - Different amounts of divergence
PAM Grouped proteins by ‘family.’ Phylogenetic tree PAM1 matrix probabilities for 1 unit of time. PAMn as (PAM1) n PAM250 often used.
BLOSUM Long-term PAM are inaccurate: Inaccuracies in PAM1 multiply! Actually differences between short-term and long-term changes. Different BLOSUM matrices are specifically determined for different levels of divergence Solves both problems.
Gap penalties No proper time-dependent model But seems reasonable that: expected number of gaps linear in time length of gaps constant distribution